Edits: (21/07/23) clarification that dependence is not equivalent to (linear) correlation, thanks to Benedikt Ehinger for catching this. (25/11/24) Fixed row ordering issue in joint probability table; and added reference to causal discovery blog post.
Full disclosure: I am not an expert in causal inference. At the outset of writing this blog post, I am very much a rookie in this field, having thought a bit about it and read a few books and papers. However, I have come to realize that the best way to learn a subject — deeply, pedantically learn it — is through trying to teach it. Putting oneself in the role of teacher induces a sort of dread, of appearing incompetent, unprepared, or unknowledgeable about the subject you are teaching. This dread provides ample motivation to dig into the details and get things right — at least that's how it works for me.
So here goes nothing 1.
What is causal inference?
"Inference" refers to any assertion about the general from the specific, or in other words, a statement which generalizes to a population from a sample of that population. A causal inference asserts that some class of event generally causes another class of event to occur. That's frustratingly abstract. Some concrete examples of causal inferences are:
- Pressing a gas pedal causes a car to accelerate
- Smoking tobacco causes lung cancer
- Eating beans causes flatulence
- Depolarization of the neuronal membrane causes an action potential
You may agree with some or all of these statements, but you might also go "hmmm". It is sometimes true, for instance, that pressure on a gas pedal causes a car to accelerate, but only under certain conditions: the tank must contain fuel, the engine must be firing, and the drive shaft must be engaged. Smoking tobacco increases the probability of developing lung cancer, but there are many people who smoke their entire lives without doing so. Despite the popular tune, not everyone farts when they eat beans. And, for those neuroscientists out there: a depolarization of a neuronal membrane is typically necessary for it to generate an action potential, but only if it surpasses the threshold potential.
These are important considerations, and tell us a bit about the nature of causal inferences.
Firstly, these assertions are conditional, meaning they assume that some set of conditions (e.g., "engine must be firing") is met. You can be more or less pedantic about these conditions. We would typically assume, for instance, that in making the gas pedal assertion, you are talking about a functionally intact car, and not one that has recently collided with a brick wall. We might also assume that the fuel in the tank is the correct type of fuel for the engine. Much of this is implied, and it can become tiresome to enumerate all the conditional assumptions one has to make in order to support a causal assertion, but it is nonetheless important to consider these whenever confusion might arise.
Secondly, these assertions are probabilistic. This means that event \(A\) doesn't always cause event \(B\) to occur, but rather increases the probability that it will. This can be formulated as a conditional probability:
$$\mathbb{P}(B|A)>\mathbb{P}(B|¬A)$$
the left side of which can be read as "the probability of B occurring given that A has occurred", and the right side as "the probability of B occurring given that A has not occurred" 2.
The question of what would have happened had event \(A\) not occurred is referred to as a counterfactual, and is a central concept in causal analysis.
For the gas pedal example, the difference \(\mathbb{P}(B|A)-\mathbb{P}(B|¬A)\) will be close to one, indicating that pressing the gas pedal (event \(A\)) will almost certainly accelerate the car (event \(B\)), while acceleration is unlikely to occur in its absence.
For the lung cancer assertion, however, this difference is somewhere between 0 and 1. We can make a rough estimate of this: the lifetime chance of getting lung cancer (before age 80) has been estimated at 1 in 7 (or 14%) for smokers, and 1 in 100 (or 1%) for non-smokers. So the difference for this causal assertion is:
$$\mathbb{P}(B|A)-\mathbb{P}(B|¬A)=0.14-0.01=0.13$$
In general, we are usually interested in random variables, rather than discrete events. Random variables represent properties or behaviours of a particular population, and are characterized by probability distributions. To formulate our causal inferences to deal with random variables, we can use expected values (denoted \(\mathbb{E}\)), rather than probabilities. This gives us an effect size 3:
$$\Delta\mathbb{E} =\mathbb{E}(B|A)-\mathbb{E}(B|¬A)$$
Confounding
However, just observing that \(\Delta\mathbb{E}>0\) is not sufficient to support an inference that \(A\) causes \(B\) (denoted \(A \rightarrow B\)). That is because of confounding. You might observe that you fart whenever you eat beans, but you may also always eat beans with tomato sauce, while sitting down. How can you be sure it's not the sauce and/or the sitting that are causing flatulence?
The classic approach to this is to conduct a randomized controlled experiment. You would give one randomly assigned group beans, tomato sauce, and chairs, and another group the same minus the beans. Here, you have controlled for the confounders and manipulated only the presence of beans 4.
This is straightforward to do with beans and flatulence, or indeed with cars and gas pedals. With the relationship between tobacco smoking and cancer, however, it's very difficult to perform an experiment that lasts the lifetimes of all participants, and very unethical to manipulate groups such that one is administered tobacco throughout life and the other not.
This difficulty is inherent in many important questions we would like causal answers to: the factors leading to Alzheimer's-type dementia, the relationship between alcohol dependence and depression, whether cannabis consumption causes schizophrenia, etc. These are typically probabilistic phenomena that occur over long time scales, can only be observed rather than manipulated, and entail numerous potential confounders.
At a different scale of observation, another important causal question is whether we can infer about causal relationships between neurons or regions of the human brain. Since there are an estimated 86 billion neurons and on the order of 1015 connections between them, this question is plagued by an extreme overabundance of confounders. There is no conceivable experiment that can help us disentangle these confounders when we attempt to perform causal inference on individual neuronal connections (but see this blog post for a discussion).
Critical to this discussion is the concept of dependence. Generally speaking, two variables are dependent if there exists a non-zero correlation between them; and thus they are independent if not 5. Dependence can arise due to a causal relationship, but also a non-causal one (such as having a common third variable that causes both). When referring to dependence between two variables such that one causes the other (either directly or via a "chain" of intermediary events), I will here use the term causal dependence. In what follows, we will be using these terms fairly extensively, so it's important to appreciate the distinction.
The "do" operator
Datasets based on passive observation, rather than experimental manipulation, are commonly called observational data. As we've seen, for such a dataset, the difference \(\mathbb{E}(B|A)-\mathbb{E}(B|¬A)\) is insufficient evidence for inferring the causal effect \(A \rightarrow B\), due to the presence of uncontrolled confounders.
To address this, we can introduce a new function called the do operator, denoted \(do(A)\) 6. This operator signifies that we (or some imaginary experimenter) have intervened to cause event \(A\) to occur. In our beans experiment above, we intervened to give beans to one group, so our inference is based on a test of:
$$\Delta\mathbb{E}=\mathbb{E}(B|do(A))-\mathbb{E}(B|¬do(A))>0$$
This formulation extends naturally to experimental data, but — crucially — is also useful to analyze purely observational data (where no actual intervention has been done), given a few important constraints which we will explore below.
Causal graphs: Visualizing the logic
In a seminal paper called Causal Diagrams for Empirical Research, Judea Pearl introduced the idea of using graph representations of hypothesized relationships between variables to help simplify and support causal analyses. The basic idea is simple: represent variables as nodes and direct causal relationships as directed edges (i.e., arrows). Such a graph is a visual representation of a causal model.
Here is what our beans example would look like as a graph:
Importantly, there is as much information in absent edges as in present ones. Absent edges represent the assumption that two variables are causally independent of one another. Even prior to our experiment, we will probably be happy to assume that eating beans, eating sauce, and sitting in a chair do not cause each other.
Two important properties of the above graph representation are (1) it is directed (arrows indicate the direction of the causal relationship for each edge), and (2) it is acyclic; in other words, it does not have any cycles, or in yet other words, there is no way to follow the edges such that you end up where you started. Such a graph is thus referred to as a directed acyclic graph (or DAG). Causal graph representations must have these properties in order for many of the tricks described below to be valid.
DAGs often use familial terms (think family tree) to refer to relationships between nodes. Thus, in the edge \(A \rightarrow B\), \(A\) is a parent of \(B\) and \(B\) is a child of \(A\). In the set of edges \(A \rightarrow B \rightarrow C\), \(A\) is an ancestor of both \(B\) and \(C\), which are descendants of \(A\). In the graph \(B \leftarrow A \rightarrow C\), \(B\) and \(C\) are siblings. Hopefully you get the picture.
Graph junctions: Chains, forks, and colliders
Graph junctions in DAGs refer to the patterns of input and output edges for a given node (or set of nodes). Let's consider the gas pedal example. Our simple causal relationship was \(A \rightarrow B\), where \(A\) is a gas pedal press event, and \(B\) is an acceleration event. A more detailed schematic of a modern throttle system looks something like this, however:
Ignoring the sensor feedback loop, and the additional steps linking throttle valve state to acceleration of the car, we still have a lot of additional intermediate variables to possibly consider. The diagram above can be expressed as: \(A \rightarrow X_1 \rightarrow X_2 \rightarrow X_3 \rightarrow X_4 \rightarrow B \). Or, as a graph:
This configuration is a called a causal chain, referring the the series of "links" between each subsequent event. Two important properties of causal chains are:
- We can (usually) remain confident that there is a causal relationship to any node from any ancestor (going up the chain). In other words, our initial model \(A \rightarrow B\) is still true. This is due to transitivity.
- If we can observe \(X_4\) (the state of the throttle valve), we don't need to know anything about its ancestors to predict \(B\) (vehicle acceleration). In other words, \(B\) is independent of \(A\), after accounting for (or manipulating) \(X_4\). This is called conditional independence.
The term conditional is important here: when we say "conditional on \(X\)," we are implying either that (a) we have manipulated \(X\) to fix its value — \(do(X=x)\) — or (b) we have accounted for or "regressed out" the covariance of \(X\) with the nodes of interest.
The notation for saying "\(A\) is independent of \(B\) conditional on \(X\)" is:
$$A \! \perp \!\!\! \perp \! B | X$$
Another common type of graph junction is called a fork. Forks occur when one event causes two or more subsequent (causally independent) events. Continuing the above example, when the throttle opens, this causes both \(B\) (acceleration) and \(X_5\) (the throttle sensor to change its output to the control unit):
The fork above can be written as: \(X_5 \leftarrow X_4 \rightarrow B\). Or, as a graph:
Despite being causally independent, the child nodes of a causal fork (\(B\) and \(X_5\)) are likely to be dependent (correlated) because they have a common causal factor (the throttle sensor will be strongly correlated with vehicle acceleration despite having no direct causal relationship). However, this correlation would not be present if we fixed the value of the throttle 7; in other words, they are independent, conditional on \(X_4\). This can be expressed as equal probabilities:
$$B \! \perp \!\!\! \perp \! X_5 | X_4 \implies \mathbb{P}(B|X_4,X_5) = \mathbb{P}(B|X_4)$$ $$B \! \perp \!\!\! \perp \! X_5 | X_4 \implies \mathbb{P}(X_5|X_4,B) = \mathbb{P}(X_5|X_4)$$
A third type of junction is called a collider. This is when an event is caused by two or more antecedent events; or in other words, when a node has two or more parents. To stick with the throttle system example, suppose we added a cruise control system \(X_6\), that automatically adjusts the output of the throttle control unit (\(X_2\)):
In this system, both the gas pedal and the cruise control system cause changes to the control unit's output (and subsequent events). This relationship can be written as \(X_1 \rightarrow X_2 \leftarrow X_6\). Or, as a graph:
Colliders have a fairly unintuitive property: despite \(X_1\) and \(X_6\) being causally independent, if we fix the value of \(X_2\), they become dependent (correlated) 8. For example, if we observe that the output of the control unit is 50% throttle, but we are not pressing the gas pedal, we can deduce that the cruise control is active. This is called conditional dependence, and \(X_1\) and \(X_6\) are said to be dependent conditional on \(X_2\).
Probabilistic reasoning on DAGs
A causal model expressed as a DAG where nodes are events or continuous random variables is also called a causal Bayesian network. Bayesian networks allow us to:
- Assign joint probabilities (or probability distributions) to the model, and prior probabilities to its nodes, based on its causal structure and the conditional independence relationships it imposes
- Simulate interventions using the do-operator, and adjust the causal graph accordingly to predict the change in its probability distributions
Let's consider these in turn.
Joint probability on a Bayesian network refers to the probability of all variables having specific values; i.e., probabilities of discrete states of the system. This is done using the chain rule of probability, that the joint probability of a (topologically ordered) set of variables is:
$$\mathbb{P}(X_1,...,X_n)=\mathbb{P}(X_1)\mathbb{P}(X_2|X_1)\mathbb{P}(X_3|X_1,X_2) \cdot ... \cdot \mathbb{P}(X_n|X_1,...,X_{n-1})$$
In words, this means that the joint probability is the product of the marginal probability of the first node, with the probabilities of all other nodes, conditional on their ancestors. Determining these conditional probabilities can get quite complicated, but we can use what we learned above about conditional independence relationships to simplify this calculation. Specifically, we know that conditioning on a node's parent renders it independent of all other ancestors. For our chain example:
The full chain of ancestors for each node can be reduced to a single parent, e.g.: \(\mathbb{P}(X_4|A,X_1,X_2,X_3) = \mathbb{P}(X_4|X_3)\). This reduces our joint probability for a causal model to 9:
$$\mathbb{P}(X_1,...,X_n)=\prod_{k=1}^{n}\mathbb{P}(X_k|Pa(X_k))$$
where \(Pa(X_k)\) refers to the set of parent nodes of \(X_k\). For the chain example this is:
$$\mathbb{P}(A) \cdot \mathbb{P}(X_1|A) \cdot \mathbb{P}(X_2|X_1) \cdot \mathbb{P}(X_3|X_2) \cdot \mathbb{P}(X_4|X_3) \cdot \mathbb{P}(B|X_4)$$
To get more concrete, let's circle back to our smoking and lung cancer example. As a causal graph, we can posit a model such as:
The joint probability for this graph is:
$$\mathbb{P}(X) \cdot \mathbb{P}(Y) \cdot \mathbb{P}(R) \cdot \mathbb{P}(A|Y) \cdot \mathbb{P}(S|A,R) \cdot \mathbb{P}(T|S) \cdot \mathbb{P}(L|X,T)$$
Suppose, for simplicity, that each of these variables is binary (1=true, 0=false), and that we can estimate their base rates (prior probabilities) by sampling the population (i.e., using an observational approach). As a complete contrivance, let's say we find that the probability of a random person carrying gene X (conferring vulnerability to lung cancer) is 20%, that of gene Y (conferring predisposition to addictive behaviours) is 30%, and that of being exposed to peer pressure is 75%. In math, \(\mathbb{P}(X=1)=0.2\), \(\mathbb{P}(Y=1)=0.3\), and \(\mathbb{P}(R=1)=0.75\).
The conditional probabilities can be shown as tables:
\(Y\) | \(\mathbb{P}(A|Y)\) | \(\mathbb{P}(A)\) |
---|---|---|
1 | 0.60 | 0.18 |
0 | 0.25 | 0.175 |
Sum: | 0.355 |
With the marginal probability computed as:
$$\mathbb{P}(A)=\mathbb{P}(Y=1)\mathbb{P}(A|Y=1)+\mathbb{P}(Y=0)\mathbb{P}(A|Y=0)=0.3 \cdot 0.6 + 0.7 \cdot 0.25 = 0.355$$
\(A\) | \(R\) | \(\mathbb{P}(S|A,R)\) | \(\mathbb{P}(S)\) |
---|---|---|---|
1 | 1 | 0.60 | 0.160 |
1 | 0 | 0.35 | 0.031 |
0 | 1 | 0.25 | 0.121 |
0 | 0 | 0.06 | 0.010 |
Sum: | 0.321 |
With marginal probability: \(\mathbb{P}(S)=\mathbb{P}(A=1)\mathbb{P}(R=1)\mathbb{P}(S|A=1,R=1)+...=0.321\)
\(S\) | \(\mathbb{P}(T|S)\) | \(\mathbb{P}(T)\) |
---|---|---|
1 | 0.90 | 0.289 |
0 | 0.05 | 0.034 |
Sum: | 0.323 |
With marginal probability: \(\mathbb{P}(T)=0.323\)
\(X\) | \(T\) | \(\mathbb{P}(L|X,T)\) | \(\mathbb{P}(L)\) |
---|---|---|---|
1 | 1 | 0.20 | 0.026 |
1 | 0 | 0.01 | 0.003 |
0 | 1 | 0.09 | 0.017 |
0 | 0 | 0.005 | 0.002 |
Sum: | 0.048 |
With marginal probability: \(\mathbb{P}(L)=0.048\)
Thus, this causal model predicts a lifetime risk of lung cancer of 4.8%, which is fairly close to empirical findings.
We can also use this model to simulate an intervention. Using the do-operator, we can, for instance, create a design that is unethical in practice: force individuals to either smoke over a lifetime (\(do(S=1)\)), or not (\(do(S=0)\)).
Suppose our research question is: "what is the effect of smoking on the lifetime rate of lung cancer"? This can be written as \(\mathbb{P}(L|do(S=1))\). By setting \(S\) to a constant value, we are essentially removing all edges into \(S\). For both possible interventions, the marginal probability of \(T\) adjusts to:
$$\mathbb{P}(T|do(S=0))=0.05$$ $$\mathbb{P}(T|do(S=1))=0.9$$
The adjusted probabilities for \(L\) are:
\(X\) | \(T\) | \(\mathbb{P}(L|X,T)\) | \(\mathbb{P}(L|do(S=0))\) | \(\mathbb{P}(L|do(S=1))\) |
---|---|---|---|---|
1 | 1 | 0.20 | 0.0080 | 0.1440 |
1 | 0 | 0.01 | 0.0019 | 0.0002 |
0 | 1 | 0.09 | 0.0009 | 0.0162 |
0 | 0 | 0.005 | 0.0038 | 0.0004 |
Sum: | 0.009 | 0.160 |
So, our model predicts that smoking results in a lifetime risk of 16% for lung cancer. For non-smokers, the risk reduces to 0.9%. Which yields an odds ratio of 16/0.9=18! This seems quite high, but does agree (across sexes) with the lifetime numbers in this study.
Note that, if we wanted to perform statistical inference on this outcome, it would depend on the size of the sample we used to generate our probabilities. Say that our sample had size \(n=500\), then we have two expected values (indicating how many of this sample are expected to develop lung cancer in their lifetimes, given our intervention) of:
$$\mathbb{E}(L|do(S=0)) = n \cdot \mathbb{P}(L|do(S=0)) = 500 \cdot 0.009 = 4.5$$ $$\mathbb{E}(L|do(S=1)) = n \cdot \mathbb{P}(L|do(S=1)) = 500 \cdot 0.16 = 80$$
We can now use a chi-square test or Fisher's exact test to test whether our observed difference \(\Delta\mathbb{E}=\mathbb{E}(L|do(S=1))-\mathbb{E}(L|do(S=0))\) is expected under the null hypothesis \(H_0: \Delta\mathbb{E}=0\). This is our contingency table, after rounding these expected values to integers:
\(L=0\) | \(L=1\) | |
---|---|---|
\(S=0\) | 496 | 4 |
\(S=1\) | 420 | 80 |
For our data, the test is significant (\(p<0.001\)), indicating that this causal effect is likely generalizable.
Dealing with unobserved data
In this lung cancer example, we observed all the variables of interest, and assumed that there were no unmeasured confounders (variables that commonly influence two or more of the model variables). If this is true, the model is called causally sufficient 10. One way to formally represent causal sufficiency is to add a new node for each node in the model, representing the unmeasured influences (or "noise") for its variable. If each of these noise nodes has only one (outgoing) edge, we are explicitly asserting causal independence between each, and excluding the existence of confounders. Our lung cancer model would then look like this:
I used circles instead of images here, with (following Pearl) open circles to represent unobserved variables and dashed arrows to represent their causal influence on target nodes. Adding error nodes tends to make the graph cluttered, so they can often be omitted with the assumption simply stated that they are independent (or that the model is causally sufficient).
If a confounder is known (or suspected) but unmeasured, it should be added to the graph in order to make it causally sufficient. For our smoking model, let's consider the possibility that cultural acceptance, \(C\), of cigarette smoking (which we have not measured) can influence the probabilities of being peer pressured, of deciding to smoke, and also (via second-hand smoke, \(H\), in social venues) the level of tar in one's lungs. Our updated graph would look like this:
Since we have no data on \(C\) or \(H\), we can't include them in our conditional probability calculations. What now?
Well, our goal here is to demonstrate that these confounders don't matter for determining an answer to our research question. In other words, for our lung cancer example, we want to show that the two unobserved variables, \(C\) and \(H\), are independent of our variable of interest, \(L\), conditional on one (or more) of our observed variables.
The following sections delve deeper into this goal.
D-separation
If we put all the variables in our gas pedal example together in a single graph, we get this 11:
We can use this causal model to look for a path between pairs of variables, and assess whether other nodes block that path. By "path" we are talking about any set of edges connecting two nodes, regardless of their direction. A path, in other words, is allowed to go against the direction of the arrows. When we say that a node (or set of nodes) \(Z\) "blocks" a path between two nodes \(A\) and \(B\), we mean that these variables become independent, conditional on \(Z\). We can use the properties of chains, forks, and colliders, described above, to determine this.
Say, for instance, that we wanted to test the hypothesis that the throttle valve drive (\(X_3\)) is necessary for the causal link between depressing the gas pedal (\(A\)) and acceleration (\(B\)). In other words, we want to test whether \(A\) is independent of \(B\), conditional on \(X_3\). To do this, we have to show that all paths between \(A\) and \(B\) are blocked by \(X_3\). This is straightforward to assess visually, since the only path between \(A\) and \(B\) (shown in green above) is a causal chain that contains \(X_3\).
A node, or more generally a set of nodes \(Z\), that blocks all paths between two other nodes \(A\) and \(B\) is said to d-separate these nodes 12. In other words, if \(A\) and \(B\) are d-separated by \(Z\), they are independent, conditional on \(Z\). If two variables are not d-separated by \(Z\), they are said to be d-connected.
Some useful rules about d-connectedness:
- Two nodes \(A\) and \(B\) are unconditionally d-connected if there is a path between them that does not contain a collider (two arrows meeting). In the graph below, \(A\) and \(B\) are unconditionally d-connected, whereas \(U\) and \(V\) are not:
- Two nodes \(A\) and \(B\) are conditionally d-connected, conditional on a third node (or set of nodes) \(Z\), if there is a path \(L\) between them such that each node \(X_i \subseteq L\): (a) is not a collider, and (b) is not a member of \(Z\); except in the case of #3 below. If this is not the case, then \(Z\) is said to d-separate \(A\) and \(B\). In the graphs below, \(A\) and \(B\) are d-connected, conditional on \(Z\), whereas \(U\) and \(V\) are not:
- \(A\) and \(B\) are also conditionally d-connected if node \(X_i \subseteq L\) is a collider, and either it or one of its descendants is a member of \(Z\). This one's tricky, but refers to the counterintuitive property of colliders described above; i.e., that two common causes of a node become dependent (correlated) when conditioning on that node. In the graph below, I've changed the direction of edge \(Z \rightarrow V\) to \(V \rightarrow Z\). This makes \(Z\) a collider in the path \(U \rightarrow Z \leftarrow V\), meaning that \(U\) and \(V\) are now d-connected, conditional on \(Z\):
The "descendant" part of rule 3 needs further elaboration. It implies that in the graph below, \(U\) and \(V\) are still d-connected, because \(Z\) is a descendant of collider \(B\):
Why is this? Essentially, because if we fix the value of \(Z\), we are also fixing the value of \(B\), which causes \(Z\). Fixing \(B\) means that we make \(U\) and \(V\) conditionally dependent.
Using D-separation to deal with confounders
We can apply the concept of d-separation to the problem of determining whether our unobserved variables are confounders. Returning to our lung cancer example:
We want to determine whether our inference about the causal effect of smoking \(S\) on lung cancer \(L\) is confounded by cultural acceptance \(C\). This is the same as asking whether \(C\) and \(L\) are d-separated, conditional on \(S\). Conditioning on \(S\), we can see that the paths \(C \rightarrow S \rightarrow T \rightarrow L\) (shown in green) and \(C \rightarrow R \rightarrow S \rightarrow T \rightarrow L\) (shown in red) are blocked, but the path \(C \rightarrow H \rightarrow T \rightarrow L\) (shown in blue) is not.
In a nutshell, this means that we cannot infer about the causal effect of smoking on lung cancer, given the unmeasured confounders of cultural acceptance and second-hand smoke. This is because the marginal probability of \(T\) is now conditional on both \(S\) and \(H\), and the latter cannot be determined.
Eek.
Notably, if the route via second-hand smoke did not exist (e.g., because smoking in public spaces is banned), then \(C\) is d-separated from \(L\), and we could happily ignore it in our causal inference \(S \rightarrow L\). Sadly, this is not the case.
So what are our options? Here are two:
If we are still in the planning phase of the research design, this model could inform us of the importance of including second-hand smoke as a measured variable, while also informing us that, at least for the \(do(S=s)\) intervention, cultural acceptance — and all other ancestors of \(S\) — do not need to be measured.
Depending on the size of our dataset, we could decide to analyze a subset of it, such that \(C\) is constant (in other words, only choose localities where smoking is or is not accepted). This allows us to factor out the influence of \(H\), without determining it directly 13.
Wrapping up
Okay, deep breath. Hopefully some of you have not only made it to this point, but have found some inspiration about how causal inference might be applicable to your own research interests. It is not a trivial topic, but neither should it be considered voodoo, especially after practicing a bit with graph representations of your data/model!
This is only the tip of a large iceberg, however. There is much more to learn, including:
The examples above take a predetermined causal graph as a starting point, but what if we do not have a causal structure in mind, and/or would like to determine the most likely casual graph given our observational dataset? This is the goal of causal discovery approaches (now the subject of another blog post).
Many datasets consist of time series (or dynamical) datasets. These are graphs whose causal structure may be fixed, but whose variables are random processes that fluctuate over time. The influence of variable \(A\) on \(B\) may have an inherent delay (or lag). Here is a taster of methods that exist to infer causality in such systems.
Many systems, such as the throttle example above, actually include cycles (e.g., feedback loops where sensors cause changes in their own ancestors). How do we model these systems? The answer lies partially in the time series approaches introduced in the preceding bullet point, but it's an important question. Methods have been developed to deal with cyclical causal models (see this post).
These are topics I hope to tackle in future blog posts.
Some further useful reading, and resources that have helped me with this post (see also the footnotes below):
- Arthur Mello, Towards Data Science Medium post (starting point for numerous other posts on TDS)
- Judea Pearl's original paper
- Pearl and MacKenzie's The Book of Why, a more generally accessible book explaining the concepts of causal inference and its importance
- Free online course (with R code) from Leslie Myint at Macalester College
-
That being said, I implore anyone reading this to get in touch with any criticism or doubts about the veracity of the contents of this blog post. I will iterative update the document and make the necessary apologies. My hope is that this really becomes a useful inroad to an important and complex field, and my worry is that I introduce erroneous or misleading information to anyone attempting to follow this inroad. ↩︎
-
The symbol "\(¬\)" signifies a negation ↩︎
-
Note that the expectation is a generalization of the discrete event formulation. For example, considering that variable HasCancer has a binary distribution (can be either "yes" or "no"), its expected value \(E(X)=np\), where \(n\) is the sample size, and \(p\) is the probability. See https://online.stat.psu.edu/stat500/lesson/3/3.2/3.2.2. ↩︎
-
To my knowledge, this important experiment has never been conducted... ↩︎
-
This refers to correlation in its broad sense, as any statistical relationship between two variables. In its common use, correlation is a form of dependence between two variables where one is a monotonic function of the other. There are many forms of dependence where the function mapping one variable to the other is not monotonic, such as the function of a circle (\(x^2+y^2=r^2\)). The probabilistic reasoning in this post still holds for such forms of dependency. Thanks to Benedikt Ehinger for pointing this out! ↩︎
-
In his original text, Judea Pearl introduces the do operator using the notation \(\check{A}\). I find "do" used as a function to be more intuitive, and it has become common notation. ↩︎
-
We haven't yet modelled "noise", but assume that the unmeasured noise factors for \(B\) and \(X_5\) are independent. ↩︎
-
The logic here is identical to that of the Monte Hall Problem. See this article for elaboration. ↩︎
-
The pi symbol, \(\prod\), is the product operator, indicating a multiplication over all members of a sequence. As a second point here, the first term in this product, \(\mathbb{P}(A)\), is unconditional, because it has no parents; i.e., \(Pa(A)=\emptyset\), where \(\emptyset\) denotes the empty set. Conditioning on an empty set is technically undefined, but the lack of conditioning variables is typically understood to imply unconditionality. See this thread for nerdy discussion. ↩︎
-
For a deeper (philosophical) dive into the concept of causal sufficiency, see this article ↩︎
-
To avoid introducing a cycle (and violating the DAG assumption) I haven't closed the feedback loop between the throttle sensor \(X_5\) and the control unit \(X_2\). ↩︎
-
The "d" here stands for "directional". See this explanation by Judea Pearl. ↩︎
-
This is because our marginal probability \(\mathbb{P}(H)\) becomes a constant, which can be conveniently dropped when testing the hypothesis \(H_0: \Delta\mathbb{E}=0\) (i.e., it appears on both sides of the inequality). ↩︎