Andrew Reid PhD

ABOUT CV PUBLICATIONS RESEARCH SOFTWARE CONTACT BLOG CODE

Blog posts

Intro
Search
Causal inference: An introduction
Published on 2023-07-17
by Andrew Reid
#20

Edits: (21/07/23) clarification that dependence is not equivalent to (linear) correlation, thanks to Benedikt Ehinger for catching this. (25/11/24) Fixed row ordering issue in joint probability table; and added reference to causal discovery blog post.

Full disclosure: I am not an expert in causal inference. At the outset of writing this blog post, I am very much a rookie in this field, having thought a bit about it and read a few books and papers. However, I have come to realize that the best way to learn a subject — deeply, pedantically learn it — is through trying to teach it. Putting oneself in the role of teacher induces a sort of dread, of appearing incompetent, unprepared, or unknowledgeable about the subject you are teaching. This dread provides ample motivation to dig into the details and get things right — at least that's how it works for me.

So here goes nothing 1.

What is causal inference?

"Inference" refers to any assertion about the general from the specific, or in other words, a statement which generalizes to a population from a sample of that population. A causal inference asserts that some class of event generally causes another class of event to occur. That's frustratingly abstract. Some concrete examples of causal inferences are:

  • Pressing a gas pedal causes a car to accelerate
  • Smoking tobacco causes lung cancer
  • Eating beans causes flatulence
  • Depolarization of the neuronal membrane causes an action potential

You may agree with some or all of these statements, but you might also go "hmmm". It is sometimes true, for instance, that pressure on a gas pedal causes a car to accelerate, but only under certain conditions: the tank must contain fuel, the engine must be firing, and the drive shaft must be engaged. Smoking tobacco increases the probability of developing lung cancer, but there are many people who smoke their entire lives without doing so. Despite the popular tune, not everyone farts when they eat beans. And, for those neuroscientists out there: a depolarization of a neuronal membrane is typically necessary for it to generate an action potential, but only if it surpasses the threshold potential.

These are important considerations, and tell us a bit about the nature of causal inferences.

Firstly, these assertions are conditional, meaning they assume that some set of conditions (e.g., "engine must be firing") is met. You can be more or less pedantic about these conditions. We would typically assume, for instance, that in making the gas pedal assertion, you are talking about a functionally intact car, and not one that has recently collided with a brick wall. We might also assume that the fuel in the tank is the correct type of fuel for the engine. Much of this is implied, and it can become tiresome to enumerate all the conditional assumptions one has to make in order to support a causal assertion, but it is nonetheless important to consider these whenever confusion might arise.

Secondly, these assertions are probabilistic. This means that event \(A\) doesn't always cause event \(B\) to occur, but rather increases the probability that it will. This can be formulated as a conditional probability:

$$\mathbb{P}(B|A)>\mathbb{P}(B|¬A)$$

the left side of which can be read as "the probability of B occurring given that A has occurred", and the right side as "the probability of B occurring given that A has not occurred" 2.

The question of what would have happened had event \(A\) not occurred is referred to as a counterfactual, and is a central concept in causal analysis.

For the gas pedal example, the difference \(\mathbb{P}(B|A)-\mathbb{P}(B|¬A)\) will be close to one, indicating that pressing the gas pedal (event \(A\)) will almost certainly accelerate the car (event \(B\)), while acceleration is unlikely to occur in its absence.

For the lung cancer assertion, however, this difference is somewhere between 0 and 1. We can make a rough estimate of this: the lifetime chance of getting lung cancer (before age 80) has been estimated at 1 in 7 (or 14%) for smokers, and 1 in 100 (or 1%) for non-smokers. So the difference for this causal assertion is:

$$\mathbb{P}(B|A)-\mathbb{P}(B|¬A)=0.14-0.01=0.13$$

In general, we are usually interested in random variables, rather than discrete events. Random variables represent properties or behaviours of a particular population, and are characterized by probability distributions. To formulate our causal inferences to deal with random variables, we can use expected values (denoted \(\mathbb{E}\)), rather than probabilities. This gives us an effect size 3:

$$\Delta\mathbb{E} =\mathbb{E}(B|A)-\mathbb{E}(B|¬A)$$

Confounding

However, just observing that \(\Delta\mathbb{E}>0\) is not sufficient to support an inference that \(A\) causes \(B\) (denoted \(A \rightarrow B\)). That is because of confounding. You might observe that you fart whenever you eat beans, but you may also always eat beans with tomato sauce, while sitting down. How can you be sure it's not the sauce and/or the sitting that are causing flatulence?

The classic approach to this is to conduct a randomized controlled experiment. You would give one randomly assigned group beans, tomato sauce, and chairs, and another group the same minus the beans. Here, you have controlled for the confounders and manipulated only the presence of beans 4.

Schematic of imaginary experiment with a beans group and no beans group. Presence of flatulence is depicted by two women, one holding her nose and the other sniffing the air happily.

This is straightforward to do with beans and flatulence, or indeed with cars and gas pedals. With the relationship between tobacco smoking and cancer, however, it's very difficult to perform an experiment that lasts the lifetimes of all participants, and very unethical to manipulate groups such that one is administered tobacco throughout life and the other not.

This difficulty is inherent in many important questions we would like causal answers to: the factors leading to Alzheimer's-type dementia, the relationship between alcohol dependence and depression, whether cannabis consumption causes schizophrenia, etc. These are typically probabilistic phenomena that occur over long time scales, can only be observed rather than manipulated, and entail numerous potential confounders.

At a different scale of observation, another important causal question is whether we can infer about causal relationships between neurons or regions of the human brain. Since there are an estimated 86 billion neurons and on the order of 1015 connections between them, this question is plagued by an extreme overabundance of confounders. There is no conceivable experiment that can help us disentangle these confounders when we attempt to perform causal inference on individual neuronal connections (but see this blog post for a discussion).

Critical to this discussion is the concept of dependence. Generally speaking, two variables are dependent if there exists a non-zero correlation between them; and thus they are independent if not 5. Dependence can arise due to a causal relationship, but also a non-causal one (such as having a common third variable that causes both). When referring to dependence between two variables such that one causes the other (either directly or via a "chain" of intermediary events), I will here use the term causal dependence. In what follows, we will be using these terms fairly extensively, so it's important to appreciate the distinction.

The "do" operator

Datasets based on passive observation, rather than experimental manipulation, are commonly called observational data. As we've seen, for such a dataset, the difference \(\mathbb{E}(B|A)-\mathbb{E}(B|¬A)\) is insufficient evidence for inferring the causal effect \(A \rightarrow B\), due to the presence of uncontrolled confounders.

To address this, we can introduce a new function called the do operator, denoted \(do(A)\) 6. This operator signifies that we (or some imaginary experimenter) have intervened to cause event \(A\) to occur. In our beans experiment above, we intervened to give beans to one group, so our inference is based on a test of:

$$\Delta\mathbb{E}=\mathbb{E}(B|do(A))-\mathbb{E}(B|¬do(A))>0$$

This formulation extends naturally to experimental data, but — crucially — is also useful to analyze purely observational data (where no actual intervention has been done), given a few important constraints which we will explore below.

Causal graphs: Visualizing the logic

In a seminal paper called Causal Diagrams for Empirical Research, Judea Pearl introduced the idea of using graph representations of hypothesized relationships between variables to help simplify and support causal analyses. The basic idea is simple: represent variables as nodes and direct causal relationships as directed edges (i.e., arrows). Such a graph is a visual representation of a causal model.

Here is what our beans example would look like as a graph:

Importantly, there is as much information in absent edges as in present ones. Absent edges represent the assumption that two variables are causally independent of one another. Even prior to our experiment, we will probably be happy to assume that eating beans, eating sauce, and sitting in a chair do not cause each other.

Two important properties of the above graph representation are (1) it is directed (arrows indicate the direction of the causal relationship for each edge), and (2) it is acyclic; in other words, it does not have any cycles, or in yet other words, there is no way to follow the edges such that you end up where you started. Such a graph is thus referred to as a directed acyclic graph (or DAG). Causal graph representations must have these properties in order for many of the tricks described below to be valid.

DAGs often use familial terms (think family tree) to refer to relationships between nodes. Thus, in the edge \(A \rightarrow B\), \(A\) is a parent of \(B\) and \(B\) is a child of \(A\). In the set of edges \(A \rightarrow B \rightarrow C\), \(A\) is an ancestor of both \(B\) and \(C\), which are descendants of \(A\). In the graph \(B \leftarrow A \rightarrow C\), \(B\) and \(C\) are siblings. Hopefully you get the picture.

Graph junctions: Chains, forks, and colliders

Graph junctions in DAGs refer to the patterns of input and output edges for a given node (or set of nodes). Let's consider the gas pedal example. Our simple causal relationship was \(A \rightarrow B\), where \(A\) is a gas pedal press event, and \(B\) is an acceleration event. A more detailed schematic of a modern throttle system looks something like this, however:

Image showing a schematic of a modern throttle system, labelled with letters A, B, X1, X2, X3, and X4

Ignoring the sensor feedback loop, and the additional steps linking throttle valve state to acceleration of the car, we still have a lot of additional intermediate variables to possibly consider. The diagram above can be expressed as: \(A \rightarrow X_1 \rightarrow X_2 \rightarrow X_3 \rightarrow X_4 \rightarrow B \). Or, as a graph:

This configuration is a called a causal chain, referring the the series of "links" between each subsequent event. Two important properties of causal chains are:

  1. We can (usually) remain confident that there is a causal relationship to any node from any ancestor (going up the chain). In other words, our initial model \(A \rightarrow B\) is still true. This is due to transitivity.
  2. If we can observe \(X_4\) (the state of the throttle valve), we don't need to know anything about its ancestors to predict \(B\) (vehicle acceleration). In other words, \(B\) is independent of \(A\), after accounting for (or manipulating) \(X_4\). This is called conditional independence.

The term conditional is important here: when we say "conditional on \(X\)," we are implying either that (a) we have manipulated \(X\) to fix its value — \(do(X=x)\) — or (b) we have accounted for or "regressed out" the covariance of \(X\) with the nodes of interest.

The notation for saying "\(A\) is independent of \(B\) conditional on \(X\)" is:

$$A \! \perp \!\!\! \perp \! B | X$$

Another common type of graph junction is called a fork. Forks occur when one event causes two or more subsequent (causally independent) events. Continuing the above example, when the throttle opens, this causes both \(B\) (acceleration) and \(X_5\) (the throttle sensor to change its output to the control unit):

Closeup of the throttle schematic, adding a new component X5, the throttle valve sensor. This demonstrates a fork configuration.

The fork above can be written as: \(X_5 \leftarrow X_4 \rightarrow B\). Or, as a graph:

Despite being causally independent, the child nodes of a causal fork (\(B\) and \(X_5\)) are likely to be dependent (correlated) because they have a common causal factor (the throttle sensor will be strongly correlated with vehicle acceleration despite having no direct causal relationship). However, this correlation would not be present if we fixed the value of the throttle 7; in other words, they are independent, conditional on \(X_4\). This can be expressed as equal probabilities:

$$B \! \perp \!\!\! \perp \! X_5 | X_4 \implies \mathbb{P}(B|X_4,X_5) = \mathbb{P}(B|X_4)$$ $$B \! \perp \!\!\! \perp \! X_5 | X_4 \implies \mathbb{P}(X_5|X_4,B) = \mathbb{P}(X_5|X_4)$$

A third type of junction is called a collider. This is when an event is caused by two or more antecedent events; or in other words, when a node has two or more parents. To stick with the throttle system example, suppose we added a cruise control system \(X_6\), that automatically adjusts the output of the throttle control unit (\(X_2\)):

A modified version of the throttle schematic, adding a new component X6, a cruise control input. This demonstrates a collider configuration.

In this system, both the gas pedal and the cruise control system cause changes to the control unit's output (and subsequent events). This relationship can be written as \(X_1 \rightarrow X_2 \leftarrow X_6\). Or, as a graph:

Colliders have a fairly unintuitive property: despite \(X_1\) and \(X_6\) being causally independent, if we fix the value of \(X_2\), they become dependent (correlated) 8. For example, if we observe that the output of the control unit is 50% throttle, but we are not pressing the gas pedal, we can deduce that the cruise control is active. This is called conditional dependence, and \(X_1\) and \(X_6\) are said to be dependent conditional on \(X_2\).

Probabilistic reasoning on DAGs

A causal model expressed as a DAG where nodes are events or continuous random variables is also called a causal Bayesian network. Bayesian networks allow us to:

  1. Assign joint probabilities (or probability distributions) to the model, and prior probabilities to its nodes, based on its causal structure and the conditional independence relationships it imposes
  2. Simulate interventions using the do-operator, and adjust the causal graph accordingly to predict the change in its probability distributions

Let's consider these in turn.

Joint probability on a Bayesian network refers to the probability of all variables having specific values; i.e., probabilities of discrete states of the system. This is done using the chain rule of probability, that the joint probability of a (topologically ordered) set of variables is:

$$\mathbb{P}(X_1,...,X_n)=\mathbb{P}(X_1)\mathbb{P}(X_2|X_1)\mathbb{P}(X_3|X_1,X_2) \cdot ... \cdot \mathbb{P}(X_n|X_1,...,X_{n-1})$$

In words, this means that the joint probability is the product of the marginal probability of the first node, with the probabilities of all other nodes, conditional on their ancestors. Determining these conditional probabilities can get quite complicated, but we can use what we learned above about conditional independence relationships to simplify this calculation. Specifically, we know that conditioning on a node's parent renders it independent of all other ancestors. For our chain example:

The full chain of ancestors for each node can be reduced to a single parent, e.g.: \(\mathbb{P}(X_4|A,X_1,X_2,X_3) = \mathbb{P}(X_4|X_3)\). This reduces our joint probability for a causal model to 9:

$$\mathbb{P}(X_1,...,X_n)=\prod_{k=1}^{n}\mathbb{P}(X_k|Pa(X_k))$$

where \(Pa(X_k)\) refers to the set of parent nodes of \(X_k\). For the chain example this is:

$$\mathbb{P}(A) \cdot \mathbb{P}(X_1|A) \cdot \mathbb{P}(X_2|X_1) \cdot \mathbb{P}(X_3|X_2) \cdot \mathbb{P}(X_4|X_3) \cdot \mathbb{P}(B|X_4)$$

To get more concrete, let's circle back to our smoking and lung cancer example. As a causal graph, we can posit a model such as:

The joint probability for this graph is:

$$\mathbb{P}(X) \cdot \mathbb{P}(Y) \cdot \mathbb{P}(R) \cdot \mathbb{P}(A|Y) \cdot \mathbb{P}(S|A,R) \cdot \mathbb{P}(T|S) \cdot \mathbb{P}(L|X,T)$$

Suppose, for simplicity, that each of these variables is binary (1=true, 0=false), and that we can estimate their base rates (prior probabilities) by sampling the population (i.e., using an observational approach). As a complete contrivance, let's say we find that the probability of a random person carrying gene X (conferring vulnerability to lung cancer) is 20%, that of gene Y (conferring predisposition to addictive behaviours) is 30%, and that of being exposed to peer pressure is 75%. In math, \(\mathbb{P}(X=1)=0.2\), \(\mathbb{P}(Y=1)=0.3\), and \(\mathbb{P}(R=1)=0.75\).

The conditional probabilities can be shown as tables:

\(Y\) \(\mathbb{P}(A|Y)\) \(\mathbb{P}(A)\)
1 0.60 0.18
0 0.25 0.175
Sum: 0.355

With the marginal probability computed as:

$$\mathbb{P}(A)=\mathbb{P}(Y=1)\mathbb{P}(A|Y=1)+\mathbb{P}(Y=0)\mathbb{P}(A|Y=0)=0.3 \cdot 0.6 + 0.7 \cdot 0.25 = 0.355$$

\(A\) \(R\) \(\mathbb{P}(S|A,R)\) \(\mathbb{P}(S)\)
1 1 0.60 0.160
1 0 0.35 0.031
0 1 0.25 0.121
0 0 0.06 0.010
Sum: 0.321

With marginal probability: \(\mathbb{P}(S)=\mathbb{P}(A=1)\mathbb{P}(R=1)\mathbb{P}(S|A=1,R=1)+...=0.321\)

\(S\) \(\mathbb{P}(T|S)\) \(\mathbb{P}(T)\)
1 0.90 0.289
0 0.05 0.034
Sum: 0.323

With marginal probability: \(\mathbb{P}(T)=0.323\)

\(X\) \(T\) \(\mathbb{P}(L|X,T)\) \(\mathbb{P}(L)\)
1 1 0.20 0.026
1 0 0.01 0.003
0 1 0.09 0.017
0 0 0.005 0.002
Sum: 0.048

With marginal probability: \(\mathbb{P}(L)=0.048\)

Thus, this causal model predicts a lifetime risk of lung cancer of 4.8%, which is fairly close to empirical findings.

We can also use this model to simulate an intervention. Using the do-operator, we can, for instance, create a design that is unethical in practice: force individuals to either smoke over a lifetime (\(do(S=1)\)), or not (\(do(S=0)\)).

Suppose our research question is: "what is the effect of smoking on the lifetime rate of lung cancer"? This can be written as \(\mathbb{P}(L|do(S=1))\). By setting \(S\) to a constant value, we are essentially removing all edges into \(S\). For both possible interventions, the marginal probability of \(T\) adjusts to:

$$\mathbb{P}(T|do(S=0))=0.05$$ $$\mathbb{P}(T|do(S=1))=0.9$$

The adjusted probabilities for \(L\) are:

\(X\) \(T\) \(\mathbb{P}(L|X,T)\) \(\mathbb{P}(L|do(S=0))\) \(\mathbb{P}(L|do(S=1))\)
1 1 0.20 0.0080 0.1440
1 0 0.01 0.0019 0.0002
0 1 0.09 0.0009 0.0162
0 0 0.005 0.0038 0.0004
Sum: 0.009 0.160

So, our model predicts that smoking results in a lifetime risk of 16% for lung cancer. For non-smokers, the risk reduces to 0.9%. Which yields an odds ratio of 16/0.9=18! This seems quite high, but does agree (across sexes) with the lifetime numbers in this study.

Note that, if we wanted to perform statistical inference on this outcome, it would depend on the size of the sample we used to generate our probabilities. Say that our sample had size \(n=500\), then we have two expected values (indicating how many of this sample are expected to develop lung cancer in their lifetimes, given our intervention) of:

$$\mathbb{E}(L|do(S=0)) = n \cdot \mathbb{P}(L|do(S=0)) = 500 \cdot 0.009 = 4.5$$ $$\mathbb{E}(L|do(S=1)) = n \cdot \mathbb{P}(L|do(S=1)) = 500 \cdot 0.16 = 80$$

We can now use a chi-square test or Fisher's exact test to test whether our observed difference \(\Delta\mathbb{E}=\mathbb{E}(L|do(S=1))-\mathbb{E}(L|do(S=0))\) is expected under the null hypothesis \(H_0: \Delta\mathbb{E}=0\). This is our contingency table, after rounding these expected values to integers:

\(L=0\) \(L=1\)
\(S=0\) 496 4
\(S=1\) 420 80

For our data, the test is significant (\(p<0.001\)), indicating that this causal effect is likely generalizable.

Dealing with unobserved data

In this lung cancer example, we observed all the variables of interest, and assumed that there were no unmeasured confounders (variables that commonly influence two or more of the model variables). If this is true, the model is called causally sufficient 10. One way to formally represent causal sufficiency is to add a new node for each node in the model, representing the unmeasured influences (or "noise") for its variable. If each of these noise nodes has only one (outgoing) edge, we are explicitly asserting causal independence between each, and excluding the existence of confounders. Our lung cancer model would then look like this:

I used circles instead of images here, with (following Pearl) open circles to represent unobserved variables and dashed arrows to represent their causal influence on target nodes. Adding error nodes tends to make the graph cluttered, so they can often be omitted with the assumption simply stated that they are independent (or that the model is causally sufficient).

If a confounder is known (or suspected) but unmeasured, it should be added to the graph in order to make it causally sufficient. For our smoking model, let's consider the possibility that cultural acceptance, \(C\), of cigarette smoking (which we have not measured) can influence the probabilities of being peer pressured, of deciding to smoke, and also (via second-hand smoke, \(H\), in social venues) the level of tar in one's lungs. Our updated graph would look like this:

Since we have no data on \(C\) or \(H\), we can't include them in our conditional probability calculations. What now?

Well, our goal here is to demonstrate that these confounders don't matter for determining an answer to our research question. In other words, for our lung cancer example, we want to show that the two unobserved variables, \(C\) and \(H\), are independent of our variable of interest, \(L\), conditional on one (or more) of our observed variables.

The following sections delve deeper into this goal.

D-separation

If we put all the variables in our gas pedal example together in a single graph, we get this 11:

We can use this causal model to look for a path between pairs of variables, and assess whether other nodes block that path. By "path" we are talking about any set of edges connecting two nodes, regardless of their direction. A path, in other words, is allowed to go against the direction of the arrows. When we say that a node (or set of nodes) \(Z\) "blocks" a path between two nodes \(A\) and \(B\), we mean that these variables become independent, conditional on \(Z\). We can use the properties of chains, forks, and colliders, described above, to determine this.

Say, for instance, that we wanted to test the hypothesis that the throttle valve drive (\(X_3\)) is necessary for the causal link between depressing the gas pedal (\(A\)) and acceleration (\(B\)). In other words, we want to test whether \(A\) is independent of \(B\), conditional on \(X_3\). To do this, we have to show that all paths between \(A\) and \(B\) are blocked by \(X_3\). This is straightforward to assess visually, since the only path between \(A\) and \(B\) (shown in green above) is a causal chain that contains \(X_3\).

A node, or more generally a set of nodes \(Z\), that blocks all paths between two other nodes \(A\) and \(B\) is said to d-separate these nodes 12. In other words, if \(A\) and \(B\) are d-separated by \(Z\), they are independent, conditional on \(Z\). If two variables are not d-separated by \(Z\), they are said to be d-connected.

Some useful rules about d-connectedness:

  1. Two nodes \(A\) and \(B\) are unconditionally d-connected if there is a path between them that does not contain a collider (two arrows meeting). In the graph below, \(A\) and \(B\) are unconditionally d-connected, whereas \(U\) and \(V\) are not:
  1. Two nodes \(A\) and \(B\) are conditionally d-connected, conditional on a third node (or set of nodes) \(Z\), if there is a path \(L\) between them such that each node \(X_i \subseteq L\): (a) is not a collider, and (b) is not a member of \(Z\); except in the case of #3 below. If this is not the case, then \(Z\) is said to d-separate \(A\) and \(B\). In the graphs below, \(A\) and \(B\) are d-connected, conditional on \(Z\), whereas \(U\) and \(V\) are not:
  1. \(A\) and \(B\) are also conditionally d-connected if node \(X_i \subseteq L\) is a collider, and either it or one of its descendants is a member of \(Z\). This one's tricky, but refers to the counterintuitive property of colliders described above; i.e., that two common causes of a node become dependent (correlated) when conditioning on that node. In the graph below, I've changed the direction of edge \(Z \rightarrow V\) to \(V \rightarrow Z\). This makes \(Z\) a collider in the path \(U \rightarrow Z \leftarrow V\), meaning that \(U\) and \(V\) are now d-connected, conditional on \(Z\):

The "descendant" part of rule 3 needs further elaboration. It implies that in the graph below, \(U\) and \(V\) are still d-connected, because \(Z\) is a descendant of collider \(B\):

Why is this? Essentially, because if we fix the value of \(Z\), we are also fixing the value of \(B\), which causes \(Z\). Fixing \(B\) means that we make \(U\) and \(V\) conditionally dependent.

Using D-separation to deal with confounders

We can apply the concept of d-separation to the problem of determining whether our unobserved variables are confounders. Returning to our lung cancer example:

We want to determine whether our inference about the causal effect of smoking \(S\) on lung cancer \(L\) is confounded by cultural acceptance \(C\). This is the same as asking whether \(C\) and \(L\) are d-separated, conditional on \(S\). Conditioning on \(S\), we can see that the paths \(C \rightarrow S \rightarrow T \rightarrow L\) (shown in green) and \(C \rightarrow R \rightarrow S \rightarrow T \rightarrow L\) (shown in red) are blocked, but the path \(C \rightarrow H \rightarrow T \rightarrow L\) (shown in blue) is not.

In a nutshell, this means that we cannot infer about the causal effect of smoking on lung cancer, given the unmeasured confounders of cultural acceptance and second-hand smoke. This is because the marginal probability of \(T\) is now conditional on both \(S\) and \(H\), and the latter cannot be determined.

Eek.

Notably, if the route via second-hand smoke did not exist (e.g., because smoking in public spaces is banned), then \(C\) is d-separated from \(L\), and we could happily ignore it in our causal inference \(S \rightarrow L\). Sadly, this is not the case.

So what are our options? Here are two:

  • If we are still in the planning phase of the research design, this model could inform us of the importance of including second-hand smoke as a measured variable, while also informing us that, at least for the \(do(S=s)\) intervention, cultural acceptance — and all other ancestors of \(S\) — do not need to be measured.

  • Depending on the size of our dataset, we could decide to analyze a subset of it, such that \(C\) is constant (in other words, only choose localities where smoking is or is not accepted). This allows us to factor out the influence of \(H\), without determining it directly 13.

Wrapping up

Okay, deep breath. Hopefully some of you have not only made it to this point, but have found some inspiration about how causal inference might be applicable to your own research interests. It is not a trivial topic, but neither should it be considered voodoo, especially after practicing a bit with graph representations of your data/model!

This is only the tip of a large iceberg, however. There is much more to learn, including:

  • The examples above take a predetermined causal graph as a starting point, but what if we do not have a causal structure in mind, and/or would like to determine the most likely casual graph given our observational dataset? This is the goal of causal discovery approaches (now the subject of another blog post).

  • Many datasets consist of time series (or dynamical) datasets. These are graphs whose causal structure may be fixed, but whose variables are random processes that fluctuate over time. The influence of variable \(A\) on \(B\) may have an inherent delay (or lag). Here is a taster of methods that exist to infer causality in such systems.

  • Many systems, such as the throttle example above, actually include cycles (e.g., feedback loops where sensors cause changes in their own ancestors). How do we model these systems? The answer lies partially in the time series approaches introduced in the preceding bullet point, but it's an important question. Methods have been developed to deal with cyclical causal models (see this post).

These are topics I hope to tackle in future blog posts.

Some further useful reading, and resources that have helped me with this post (see also the footnotes below):

  • Arthur Mello, Towards Data Science Medium post (starting point for numerous other posts on TDS)
  • Judea Pearl's original paper
  • Pearl and MacKenzie's The Book of Why, a more generally accessible book explaining the concepts of causal inference and its importance
  • Free online course (with R code) from Leslie Myint at Macalester College

  1. That being said, I implore anyone reading this to get in touch with any criticism or doubts about the veracity of the contents of this blog post. I will iterative update the document and make the necessary apologies. My hope is that this really becomes a useful inroad to an important and complex field, and my worry is that I introduce erroneous or misleading information to anyone attempting to follow this inroad. ↩︎

  2. The symbol "\(¬\)" signifies a negation ↩︎

  3. Note that the expectation is a generalization of the discrete event formulation. For example, considering that variable HasCancer has a binary distribution (can be either "yes" or "no"), its expected value \(E(X)=np\), where \(n\) is the sample size, and \(p\) is the probability. See https://online.stat.psu.edu/stat500/lesson/3/3.2/3.2.2. ↩︎

  4. To my knowledge, this important experiment has never been conducted... ↩︎

  5. This refers to correlation in its broad sense, as any statistical relationship between two variables. In its common use, correlation is a form of dependence between two variables where one is a monotonic function of the other. There are many forms of dependence where the function mapping one variable to the other is not monotonic, such as the function of a circle (\(x^2+y^2=r^2\)). The probabilistic reasoning in this post still holds for such forms of dependency. Thanks to Benedikt Ehinger for pointing this out! ↩︎

  6. In his original text, Judea Pearl introduces the do operator using the notation \(\check{A}\). I find "do" used as a function to be more intuitive, and it has become common notation. ↩︎

  7. We haven't yet modelled "noise", but assume that the unmeasured noise factors for \(B\) and \(X_5\) are independent. ↩︎

  8. The logic here is identical to that of the Monte Hall Problem. See this article for elaboration. ↩︎

  9. The pi symbol, \(\prod\), is the product operator, indicating a multiplication over all members of a sequence. As a second point here, the first term in this product, \(\mathbb{P}(A)\), is unconditional, because it has no parents; i.e., \(Pa(A)=\emptyset\), where \(\emptyset\) denotes the empty set. Conditioning on an empty set is technically undefined, but the lack of conditioning variables is typically understood to imply unconditionality. See this thread for nerdy discussion. ↩︎

  10. For a deeper (philosophical) dive into the concept of causal sufficiency, see this article ↩︎

  11. To avoid introducing a cycle (and violating the DAG assumption) I haven't closed the feedback loop between the throttle sensor \(X_5\) and the control unit \(X_2\). ↩︎

  12. The "d" here stands for "directional". See this explanation by Judea Pearl. ↩︎

  13. This is because our marginal probability \(\mathbb{P}(H)\) becomes a constant, which can be conveniently dropped when testing the hypothesis \(H_0: \Delta\mathbb{E}=0\) (i.e., it appears on both sides of the inequality). ↩︎

Comments here
Hammer about to hit a nail, representing a causal event.
In this post, I attempt (as a non-expert enthusiast) to provide a gentle introduction to the central concepts underlying causal inference. What is causal inference and why do we need it? How can we represent our causal reasoning in graphical form, and how does this enable us to apply graph theory to simplify our calculations? How do we deal with unobserved confounders? This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.
Tags:Stats · Causality · Causal inference · Graph theory · Teaching
stats,causality,causal inference,graph theory,Teaching
Causal discovery: An introduction
Published on 2024-09-23
by Andrew Reid
#21
This post continues my exploration of causal inference, focusing on the type of problem an empirical researcher is most familiar with: where the underlying causal model is not known. In this case, the model must be discovered. I use some Python code to introduce the PC algorithm, one of the original and most popular approaches to causal discovery. I also discuss its assumptions and limitations, and briefly outline some more recent approaches. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.
Tags:Stats · Causality · Causal inference · Causal discovery · Graph theory · Teaching
stats,causality,causal inference,causal discovery,graph theory,Teaching
Multiple linear regression: short videos
Published on 2022-08-10
by Andrew Reid
#19
In a previous series of posts, I discussed simple and multiple linear regression (MLR) approaches, with the aid of interactive 2D and 3D plots and a bit of math. In this post, I am sharing a series of short videos aimed at psychology undergraduates, each explaining different aspects of MLR in more detail. The goal of these videos (which formed part of my second-year undergraduate module) is to give a little more depth to fundamental concepts that many students struggle with. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.
Tags:Stats · Linear regression · Teaching
stats,linear regression,Teaching
Learning about multiple linear regression
Published on 2021-12-30
by Andrew Reid
#18
In this post, I explore multiple linear regression, generalizing from the simple two-variable case to three- and many-variable cases. This includes an interactive 3D plot of a regression plane and a discussion of statistical inference and overfitting. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.
Tags:Stats · Linear regression · Teaching
stats,linear regression,Teaching
Learning about fMRI analysis
Published on 2021-06-24
by Andrew Reid
#17
In this post, I focus on the logic underlying statistical inference based on fMRI research designs. This consists of (1) modelling the hemodynamic response; (2) "first-level" within-subject analysis of time series; (3) "second-level" population inferences drawn from a random sample of participants; and (4) dealing with familywise error. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.
Tags:Stats · FMRI · Hemodynamic response · Mixed-effects model · Random field theory · False discovery rate · Teaching
stats,fMRI,hemodynamic response,mixed-effects model,random field theory,false discovery rate,Teaching
Learning about simple linear regression
Published on 2021-03-25
by Andrew Reid
#16
In this post, I introduce the concept of simple linear regression, where we are evaluating the how well a linear model approximates a relationship between two variables of interest, and how to perform statistical inference on this model. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.
Tags:Stats · Linear regression · F distribution · Teaching
stats,linear regression,F distribution,Teaching
New preprint: Tract-specific statistics from diffusion MRI
Published on 2021-03-05
by Andrew Reid
#15
In our new preprint, we describe a novel methodology for (1) identifying the most probable "core" tract trajectory for two arbitrary brain regions, and (2) estimating tract-specific anisotropy (TSA) at all points along this trajectory. We describe the outcomes of regressing this TSA metric against participants' age and sex. Our hope is that this new method can serve as a complement to the popular TBSS approach, where researchers desire to investigate effects specific to a pre-established set of ROIs.
Tags:Diffusion-weighted imaging · Tractography · Connectivity · MRI · News
diffusion-weighted imaging,tractography,connectivity,MRI,News
Learning about correlation and partial correlation
Published on 2021-02-04
by Andrew Reid
#14
This is the first of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology. In this post, I will try to provide an intuitive explanation of (1) the Pearson correlation coefficient, (2) confounding, and (3) how partial correlations can be used to address confounding.
Tags:Stats · Linear regression · Correlation · Partial correlation · Teaching
stats,linear regression,correlation,partial correlation,Teaching
Linear regression: dealing with skewed data
Published on 2020-11-17
by Andrew Reid
#13
One important caveat when working with large datasets is that you can almost always produce a statistically significant result when performing a null hypothesis test. This is why it is even more critical to evaluate the effect size than the p value in such an analysis. It is equally important to consider the distribution of your data, and its implications for statistical inference. In this blog post, I use simulated data in order to explore this caveat more intuitively, focusing on a pre-print article that was recently featured on BBC.
Tags:Linear regression · Correlation · Skewness · Stats
linear regression,correlation,skewness,Stats
Functional connectivity as a causal concept
Published on 2019-10-14
by Andrew Reid
#12
In neuroscience, the conversation around the term "functional connectivity" can be confusing, largely due to the implicit notion that associations can map directly onto physical connections. In our recent Nature Neuroscience perspective piece, we propose the redefinition of this term as a causal inference, in order to refocus the conversation around how we investigate brain connectivity, and interpret the results of such investigations.
Tags:Connectivity · FMRI · Causality · Neuroscience · Musings
connectivity,FMRI,causality,neuroscience,Musings
Functional connectivity? But...
Published on 2017-07-26
by Andrew Reid
#11
Functional connectivity is a term originally coined to describe statistical dependence relationships between time series. But should such a relationship really be called connectivity? Functional correlations can easily arise from networks in the complete absence of physical connectivity (i.e., the classical axon/synapse projection we know from neurobiology). In this post I elaborate on recent conversations I've had regarding the use of correlations or partial correlations to infer the presence of connections, and their use in constructing graphs for topological analyses.
Tags:Connectivity · FMRI · Graph theory · Partial correlation · Stats
connectivity,fMRI,graph theory,partial correlation,Stats
Driving the Locus Coeruleus: A Presentation to Mobify
Published on 2017-07-17
by Andrew Reid
#10
How do we know when to learn, and when not to? Recently I presented my work to Vancouver-based Mobify, including the use of a driving simulation task to answer this question. They put it up on YouTube, so I thought I'd share.
Tags:Norepinephrine · Pupillometry · Mobify · Learning · Driving simulation · News
norepinephrine,pupillometry,Mobify,learning,driving simulation,News
Limitless: A neuroscientist's film review
Published on 2017-03-29
by Andrew Reid
#9
In the movie Limitless, Bradley Cooper stars as a down-and-out writer who happens across a superdrug that miraculously heightens his cognitive abilities, including memory recall, creativity, language acquisition, and action planning. It apparently also makes his eyes glow with an unnerving and implausible intensity. In this blog entry, I explore this intriguing possibility from a neuroscientific perspective.
Tags:Cognition · Pharmaceuticals · Limitless · Memory · Hippocampus · Musings
cognition,pharmaceuticals,limitless,memory,hippocampus,Musings
The quest for the human connectome: a progress report
Published on 2016-10-29
by Andrew Reid
#8
The term "connectome" was introduced in a seminal 2005 PNAS article, as a sort of analogy to the genome. However, unlike genomics, the methods available to study human connectomics remain poorly defined and difficult to interpret. In particular, the use of diffusion-weighted imaging approaches to estimate physical connectivity is fraught with inherent limitations, which are often overlooked in the quest to publish "connectivity" findings. Here, I provide a brief commentary on these issues, and highlight a number of ways neuroscience can proceed in light of them.
Tags:Connectivity · Diffusion-weighted imaging · Probabilistic tractography · Tract tracing · Musings
connectivity,diffusion-weighted imaging,probabilistic tractography,tract tracing,Musings
New Article: Seed-based multimodal comparison of connectivity estimates
Published on 2016-06-24
by Andrew Reid
#7
Our article proposing a threshold-free method for comparing seed-based connectivity estimates was recently accepted to Brain Structure & Function. We compared two structural covariance approaches (cortical thickness and voxel-based morphometry), and two functional ones (resting-state functional MRI and meta-analytic connectivity mapping, or MACM).
Tags:Multimodal · Connectivity · Structural covariance · Resting state · MACM · News
multimodal,connectivity,structural covariance,resting state,MACM,News
Four New ANIMA Studies
Published on 2016-03-18
by Andrew Reid
#6
Announcing four new submissions to the ANIMA database, which brings us to 30 studies and counting. Check them out if you get the time!
Tags:ANIMA · Neuroscience · Meta-analysis · ALE · News
ANIMA,neuroscience,meta-analysis,ALE,News
Exaptation: how evolution recycles neural mechanisms
Published on 2016-02-27
by Andrew Reid
#5
Exaptation refers to the tendency across evolution to recycle existing mechanisms for new and more complex functions. By analogy, this is likely how episodic memory — and indeed many of our higher level neural processes — evolved from more basic functions such as spatial navigation. Here I explore these ideas in light of the current evidence.
Tags:Hippocampus · Memory · Navigation · Exaptation · Musings
hippocampus,memory,navigation,exaptation,Musings
The business of academic writing
Published on 2016-02-04
by Andrew Reid
#4
Publishers of scientific articles have been slow to adapt their business models to the rapid evolution of scientific communication — mostly because there is profit in dragging their feet. I explore the past, present, and future of this important issue.
Tags:Journals · Articles · Impact factor · Citations · Business · Musings
journals,articles,impact factor,citations,business,Musings
Reflections on multivariate analyses
Published on 2016-01-15
by Andrew Reid
#3
Machine learning approaches to neuroimaging analysis offer promising solutions to research questions in cognitive neuroscience. Here I reflect on recent interactions with the developers of the Nilearn project.
Tags:MVPA · Machine learning · Nilearn · Elastic net · Statistics · Stats
MVPA,machine learning,nilearn,elastic net,statistics,Stats
New ANIMA study: Hu et al. 2015
Published on 2016-01-11
by Andrew Reid
#2
Announcing a new submission to the ANIMA database: Hu et al., Neuroscience & Biobehavioral Reviews, 2015.
Tags:ANIMA · Neuroscience · Meta-analysis · ALE · Self · News
ANIMA,neuroscience,meta-analysis,ALE,self,News
Who Am I?
Published on 2016-01-10
by Andrew Reid
#1
Musings on who I am, where I came from, and where I'm going as a Neuroscientist.
Tags:Labels · Neuroscience · Cognition · Musings
labels,neuroscience,cognition,Musings