Causal inference allows us to generalize about what causes what in our universe. The "what" refers to physical entities and the "cause" refers to a perturbation of the behaviour of one such entity by signals/forces originating in another. The complexity and scale of both the entities and their interactions can vary, but some general principles apply 1:
- Conditional dependence: An event \(B\) (occurring on entity \(b\)) is more probable when an event \(A\) occurs on entity \(a\) than when it does not occur: \(\mathbb{P}(B|A)>\mathbb{P}(B|¬A)\)
- Temporal precedence: If event \(A\) causes event \(B\) (written as \(A \rightarrow B\)), there must be a non-zero delay between \(A\) and \(B\)
The above conditions can be true, however, even when \(A\) does not cause \(B\), typically because both are caused by a third entity \(C\), often called a confounder. For example, the level of a thermometer will rise before the water in which it is sitting boils, but the thermometer is not causing it to boil. An increase in the water's temperature is causing both.
The \(do\) operator is used to denote a manipulation of \(a\) to produce or prevent event \(A\). Without such manipulation (i.e., via passive observation), \(\mathbb{P}(B|A)>\mathbb{P}(B|¬A)\) can still be true if both are caused by events from a third entity. Logic relating to this do operator is referred to as the "do calculus".
In a previous blog post, I introduced some of the key concepts underlying causal inference, representing causal relationships between entities as a directed acyclic graph (DAG). Here, we explored the idea of using graph theoretical methods to support causal inferences based on observational data. The catch was that we first needed to know the underlying causal structure of our graph.
But what if we don't?
In scientific research, this is the more typical case: having observed variables from a system comprised of discrete events or entities generating continuous signals, how can we infer about which entities cause changes to which other entities? This problem is generally referred to as causal discovery or exploratory causal analysis.
In this post, I will explore the main concepts underlying causal discovery and some algorithms for performing it. The same disclaimer applies as for the previous post: I am not an expert in this field, but more of an enthusiast using teaching as a way to learn more deeply.
Assumptions
Causal discovery methods typically depend on a number of strict assumptions to work, which can differ between approaches 2. Many such methods are based on the so-called common cause principle (proposed by Reichenbach): if two random variables \(A\) and \(B\) are statistically dependent (e.g., correlated), then at least one of the following is true:
- \(A \rightarrow B\)
- \(B \rightarrow A\)
- \(A \leftarrow C \rightarrow B\)
(where \(C\) is some third variable). Following on this principle are two important conditions:
The Markov condition is an assumption that a given graph node \(A\) is independent of all nodes that are not its descendants, conditional on its parent nodes. In other words, the parent nodes "shield" \(A\) from (or fully account for) influences by other entities which are their ancestors. The causal version of this assumption is that \(A\) has no causal relationship with any nodes that are not descended from it, conditional on its parents. In the graph below, nodes \(X\) and \(Y\) (coloured red) are independent of \(A\), conditional on its parent \(Z\), if the Markov condition holds. The green nodes can be causally influenced by \(A\):
The faithfulness condition states that, for a given causal DAG \(\mathcal{G}\), any pair of nodes \(\{A, B\} \in \mathcal{G}\) that are causally associated must also be statistically dependent (i.e., have a nonzero correlation). Equivalently, if \(A\) and \(B\) are conditionally independent, they have no direct causal relationship. Generally, this assumption is reasonable, but it can be violated if, for example, a direct causal relationship \(A \rightarrow B\) is cancelled out by a mediator variable \(C\) (see below) 3.
The causal sufficiency assumption asserts that a causal graph \(\mathcal{G}\) is not missing any confounders; in other words, there are no additional variables that, added to \(\mathcal{G}\), are common causes of two or more existing nodes in \(\mathcal{G}\). In yet other words, we are assuming that our graph represents observations of all possible confounders. The graph below (taken from my previous blog post) illustrates this property:
Here, the "noise" nodes represent all unobserved influences for specific variables. As long as each such node only points to a single observed variable, due to the Markov and faithfulness assumptions they cannot be considered confounders. Notably, in natural systems, this is a rather strong assumption, and probably the most difficult to satisfy. Thankfully, approaches have been proposed to deal with this limitation (we'll come back to this later).
The PC algorithm
One of the first (and most well-known) algorithms for causal discovery is called the PC algorithm 4. It relies on all three major assumptions outlined above. The algorithm consists of the following steps:
Start with a fully connected, undirected graph \(\mathcal{G}\), representing all variables of interest, plus known or potential confounders
Define a conditioning set \(Z\) that is initially empty (\(Z=\{\}\))
Remove all edges that are independent, conditional on \(Z\)
Iterate 2 and 3 across all possible conditioning sets
Assign directions to each remaining edge, based on what we know about how edge configurations (chains, forks, and colliders) relate to conditional independence, and other information if available (e.g., temporal precedence); see my previous post for a review of these configurations
The best way to get an intuitive understanding of this approach is to consider an example. We could start with a simple three-node example, but why not just jump into the smoking / lung cancer example we developed in the first blog post? Here it is:
To summarize, each variable was considered binary (1=true, 0=false) and I fabricated (conditional) probabilities for each factor determining the joint probability of the graph:
$$\mathbb{P}(X) \cdot \mathbb{P}(Y) \cdot \mathbb{P}(R) \cdot \mathbb{P}(A|Y) \cdot \mathbb{P}(S|A,R) \cdot \mathbb{P}(T|S) \cdot \mathbb{P}(L|X,T)$$
We want to use these probabilities to simulate some data. To do this we will use a simple Monte-Carlo approach: for each member of a random sample of \(N=2000\) individuals, we'll assign 0 or 1 to each variable, based on the inequality \(p_i \gt p_{rand}\), where \(p_i\) is the (conditional) probability for variable \(i\) and \(p_{rand}\) is pseudorandomly sampled from \([0,1]\). Note that we need to start with the root nodes (\(X\), \(Y\), and \(R\) — the ones without parents) since these are needed to compute the conditional probablities as we traverse the DAG.
Using these rules, I generated a sample with the following marginal probabilities 5:
Variable | \(\mathbb{P} ~ \)(population) | \(\mathbb{P} ~ \)(sample) |
---|---|---|
\(X\) | 0.200 | 0.194 |
\(Y\) | 0.300 | 0.328 |
\(R\) | 0.750 | 0.754 |
\(A\) | 0.355 | 0.352 |
\(S\) | 0.322 | 0.308 |
\(T\) | 0.323 | 0.306 |
\(L\) | 0.048 | 0.040 |
Our sample is thus a decent representation of the population. So let's pretend we know nothing about the causal graph, and run through the steps of the PC algorithm.
Steps 1-4: Pruning the graph
Here is our fully-connected, undirected graph:
Given our simulated sample, we can now start establishing (unconditional) independence relationships. For each pair of variables \(\{X_i, X_j\}\), we'll define "independence" using a logistic regression of the form 6:
$$ X_i = \beta_0 + \beta_1 X_j + \sum_{k=1}^{K} (\beta_{(k+1)} Z_k) + \epsilon $$
where \(K\) is the size (cardinality) of our conditioning set \(Z\), and \(Z_k\) is a member of \(Z\). When \(K=0\) (empty set), this term drops out and we have a simple regression.
Our job is to test the null hypothesis \(H_0: \beta_1 = 0\). If this test fails to reject \(H_0\), we conclude that 7:
$$X_i \! \perp \!\!\! \perp \! X_j | Z$$
Which means: \(X_i\) is independent of \(X_j\) conditional on \(Z\). If this is true for any \(Z\), we can further conclude that \(X_i\) and \(X_j\) are d-separated, and prune the edge between them.
When we run this procedure for our smoking graph, after one iteration (empty conditioning set), our graph looks like this:
After subsequent iterations, it starts to look more familiar:
Step 5: Adding the arrows
The first four steps used conditional independence relationships to prune down the full graph to one containing only the edges we know to exist. So far so good.
The next step is to figure out the direction in which causality flows. For example, which of the following is true: \(Y \rightarrow A\) or \(A \leftarrow Y\)?
We can use what we know about three-way configurations to help narrow this down. Since \(A\) is connected to both \(Y\) and \(S\) in our pruned graph, we have a triplet \(Y — A — S\) (where there is no edge between \(Y\) and \(S\)) with four possible configurations 8. These are shown below next to their dependency relationships:
Configuration | \(Y \! \perp \!\!\! \perp \! S\) | \(Y \! \perp \!\!\! \perp \! S | A\) |
---|---|---|
$$S \rightarrow A \rightarrow Y$$ | No | Yes |
$$S \leftarrow A \leftarrow Y$$ | No | Yes |
$$S \rightarrow A \leftarrow Y$$ | Yes | No |
$$S \leftarrow A \rightarrow Y$$ | No | Yes |
Let's unpack this a bit. The first two rows are chain configurations, where causality flows from \(S\) to \(Y\) or from \(Y\) to \(S\). In terms of dependency relationships, these are indistinguishable. Row three is a collider configuration. Recall that, for a collider, the other two nodes are unconditionally independent but become dependent when conditioning on it. Row four shows a fork. Here the opposite is true: the other two nodes are unconditionally dependent, but become independent when conditioning on their common source. This configuration is indistinguishable from the two chain configurations above.
We can eliminate some of these possibilities by looking at the unconditional dependencies (since the conditional ones are simply their inverse). In our case, we find that \(S\) and \(Y\) are unconditionally dependent. Thus we can eliminate configuration 3 (collider), but are still left with three other possibilities.
Our algorithmic strategy is therefore to first iterate through all triplets in our graph and assign arrows when we discover colliders. We can then come back to the ambiguous cases such as \(Y — A — S\) and see whether any of the edges has been resolved. If we know for instance that \(A \rightarrow S\), we can eliminate configuration 1 from the table.
- For each triplet \(A — B — C\), compute the dependency between \(A\) and \(C\)
- If \(A \! \perp \!\!\! \perp \! C\), conclude that the triplet is a collider and assign arrows to the edges: \(A \rightarrow B \leftarrow C\)
- If \(A \! \centernot{\perp \!\!\! \perp} \! C\), eliminate the collider configuration for this triplet
- Eliminate configurations that are incompatible with any newly assigned arrows
- Reiterate until (1) all edges have been assigned, or (2) no more ambiguities can be resolved
This implies that we can run through our algorithm and still have uncertainty about some edges. Indeed, when we run through these steps (using the Python code referenced in the footnotes 5), we get this result:
We have recovered all the edges except \(Y — A\)! While we know \(A \rightarrow S\), we cannot differentiate between the fork \(Y \leftarrow A \rightarrow S\) and the chain \(Y \rightarrow A \rightarrow S\). These two graphs are called Markov equivalent, and in general these graphs are said to belong to the same equivalence class; i.e., a set of graphs that equally explain the conditional independence relationships of our data. The PC algorithm therefore typically outputs multiple graphs that are Markov equivalent (more about this below).
In such cases, we can try to apply our knowledge of the system. For example, where temporal precedence is known, it can be used to disambiguate. Here, we know that genes cause phenotypes, not vice versa; so the only valid configuration is \(Y \rightarrow A \rightarrow S\) 9.
And with that minor intervention, we've fully "discovered" our causal network. Given our sample probabilities as shown in the table above, we can proceed to generate causal inferences as described in the first blog post.
Limitations of the PC algorithm
There are several criticisms that can be leveled at the PC algorithm (and causal discovery methods more generally), which typically fall into two categories: (1) the plausibility of the assumptions; and (2) its ability to discover causal relationships for empirical, rather than simulated systems. We'll briefly explore these below.
Plausibility of assumptions
Faithfulness
As explained above, the faithfulness assumption can be violated by the presence of opposing effects within a causal DAG.
Richard Scheines has argued that this is not a major shortcoming, on the basis that such configurations (where one causal route has exactly the opposite effect of another) would arise in nature only due to "incredible coincidence", and do not constitute systematic "regularities". Other arguments include the observation that faithfulness is really, really useful for reducing the number of models that fit the data.
Others disagree. Kennaway, for example, argues that faithfulness is regularly and robustly violated in control systems, in which a system self-regulates its state around some set point (see, e.g., the cruise control system introduced in the previous blog post) using feedback. However, such systems are typically cyclic, meaning that they cannot be represented as (static) DAGs. Thus, these systems do not fall under the class of systems addressed by the PC algorithm.
Another common form of control is via feedforward inhibition, however, in which positive effects are counterbalanced by negative ones via a parallel route. Feedforward configurations can be represented with DAGs. Evolving over time, parallel feedforward excitation and inhibition pathways (e.g., in a neural system) can result in an initial increase and subsequent decrease in downstream activity: the average effect of which is zero. Such a configuration is shown below (note that arrows indicate excitatory and circles inhibitory connections):
A related assumption of the PC approach is that conditional independence can be reliably inferred from a sample, which entails uncertainty. The validity of the faithfulness assumption is thus contingent on the statistical power of the sample to support correct inferences about dependence when it is present in the population. Smaller samples will produce more frequent false negatives, and even one false negative (over a large number of tests) is sufficient to produce erroneous causal graphs. Claassen and Heskes argue this point here:
In real-world data, the large sample limit does not apply, and if one or more incorrect independence decisions are made, then this can lead to a series of erroneous orientations. But this ambiguity is not apparent in the output, severely limiting the interpretability of the entire model, also because there is little or no accountability prospect for any of the causal relations found.
I'll address the "real-world accountability" issue below, but let's explore the sample size issue a bit further with our simulated system, where we have a ground truth to account for. Above, I generated a sample with a fairly large \(N\) of 2000 to avoid this problem. But what would happen if I used smaller samples? In the plot below, I'm showing the results of running the PC algorithm on 250 random samples for each sample size ranging from \(N=200\) to \(N=2000\) in steps of 200.
The Y-axis shows the average accuracy of the discovered graph (versus the ground truth) and the X-axis shows sample size. Accuracy increases from 200 to 1200, at which point it levels out close to 1.0 (perfect agreement). This is a pretty large minimal sample size!
It is notable, however, that this will depend on the specific system under evaluation, and how the variables are distributed. For my smoking example, the inclusion of a low-probability binary variable (lung cancer) meant that samples smaller than 200 could not be reliably fitted using a logistic regression model (due to rank deficiency in the generated samples), and sample sizes up to 600 still produced some samples that had this problem.
Different systems and variables are thus likely to require different absolute sample sizes. On the other hand, I do expect the relationship between accuracy and sample size to be similar for most systems, especially real-life ones with more unexplained variability. This could conceivably be used to estimate the sample size required to have sufficient power to discover a causal graph (at some tolerance level such as 90%).
Causal sufficiency
Causal sufficiency means that all common causes of the variables of interest are both included in the model and observed. For real-world systems, without laboratory controls, this assumption is quite bold. There will virtually always be some set of variables that influence your system but have not been measured. Without observation, the (lack of) existence of such confounders is impossible to test.
The problem with unobserved confounders (also called latent variables) is that they introduce a dependency between two variables that have in actuality no causal relationship. Without being able to condition on the confounder, the PC algorithm will simply report that at least one of these causes the other. To illustrate this, let's change up our causal model such that we no longer have any information about genetics:
As you can see, we've also changed the ground truth by adding a causal link between Gene Y and Peer Pressure (i.e., a carrier of Y is, presumably through some personality phenotype, more likely to experience peer pressure).
If we use this modified model to generate samples, and run the PC algorithm with only the observed variables (solid circles), we have a lot more difficulty assigning directions to the adjacent edges and eliminating colliders. Indeed, the result produces no clear causal relationships (although it does allow us to eliminate some edges):
To address this rather critical issue, Richardson and Spirtes (2002) introduced the concept of a maximal ancestral graph (MAG). A MAG is a class of mixed graph, for which edges can either be undirected, (uni-)directed, or bidirected. As for the graph shown above, a bidirected edge in a mixed graph represents ambiguity about the directionality of a discovered causal relationship.
An ancestral graph is a mixed-graph representation of ambiguity, where ambiguous edges are bidirected. It is considered maximal if no additional edges can be added to it without changing the conditional independence relationships. Such a graph allows us to represent an equivalence class as a single diagram. Exploring this in further depth would require a new post, but in a nutshell, this MAG representation can also be analyzed with respect to properties such as d-separation (called m-separation in a MAG), and the space of possible DAGs can be reduced using this formulation, allowing a more limited set of causal inferences to be supported 10.
Utility for discovering causality in real-world systems
Our smoking example above was completely simulated. Such simulated datasets are very useful in that they define the ground truth a priori, and can therefore quantify the performance of the causal discovery approach. What we really want, of course, is that these algorithms can be applied to the causal discovery of real-world systems, where the (full) ground truth is typically unknown.
The PC algorithm has been criticized because, while it performs well on particular (hand-picked) simulated systems, its efficacy for discovering causality in real-world systems is questionable. To use a rather sardonic quote from Humphreys and Freedman 11:
On pp. 132-52 and 243-50, SGS analyse a number of real examples, mainly drawn from the social-science literature. What are the scoring rules? Apparently, SGS count a win if their algorithms more or less reproduce the original findings (rule no.l); but they also count a win if their algorithms yield different findings (rule no.2). This sort of empirical test is not particularly harsh. Even so, the SGS algorithms reproduce original findings only if one is very selective in reading the computer output, as will be seen below.
Childish innuendo pervades the language of these authors, but putting this aside 12, they make a valid point that many of the original empirical examples offered by the PC algorithm authors (abbreviated SGS in the quote) provided weak evidence or were even a bit modified in presentation to remove absurd causal assertions such as "race and religion cause region of residence". In general, I was not able to easily find good independent examples of the original PC algorithm applied successfully to an empirical problem 13.
Alternative approaches to causal discovery
The PC algorithm was a first stab at the problem of causal discovery. Since its publication in 2000, enhancements and alternative approaches have been proposed by the same authors and others. Below I will briefly describe a few classes of these 14.
Constraint-based approaches
In general, this class of algorithms utilizes constraints such as conditional independence and temporal precedence to determine a Markov equivalence class of causal graphs.
This includes the PC algorithm and its direct successor, the fast causal inference (FCI) algorithm 15. FCI is a modification of the PC algorithm (and its underlying formalisms) to account for selection bias and unmeasured (latent) variables that represent potential confounders. Like PC, FCI starts with a fully connected, undirected graph and iteratively generates partial ancestral graphs (PAGs) using an extended set of rules, which can be accessed at any point if stopped before terminating. The PAG is a correct representation of the causal structure, but less informative than the maximal ancestral graph (MAG, see above). As described above, PAGs and MAGs are mixed graphs that can include unidirected but also bidirected edges, where the direction of causality is ambiguous.
Another constraint-based approach is logical causal inference (LoCI) 16, which builds upon the FCI framework to make distinct logical statements about causal relationships. LoCI defines a minimal independence set \(Z\) such that all members of \(Z\) are necessary to d- or m-separate two other nodes \(X\) and \(Y\) (no subset of \(Z\) is sufficient). Identifying such a set leads to logical rules about causal relationships between \(Z\), \(X\), and \(Y\), along with potentially other nodes along a path between \(X\) and \(Y\). A list of such relationships can be compiled by iterating over nodes, and simplified by substition to obtain a logical characterization of the system, that can be specified graphically as a PAG. Like FCI, LoCI can deal with latent confounders and selection bias, and the two have comparable time complexity.
Score-based approaches
Score-based algorithms make use of a scoring criterion that assigns a value representing the goodness of fit between a causal model and observed (or simulated) data. This criterion is used to search the space of models and find or approximate a global maximum. The main advantage of this approach is that is can leverage the abundance of existing optimization algorithms to arrive at the most parsimonious solution(s).
Greedy equivalence search (GES) 17 is one of the first and most widely used score-based approaches. As the name implies, GES uses a greedy algorithm to search for increasingly better-fitting equivalence classes of DAGs (as opposed to individual DAG configurations). This is accomplished by defining a "complete partial DAG" (CPDAG) \(\mathcal{P^c}\), which is a unique mixed-graph representation of an equivalence class. A single edge transformation can then be applied (i.e., addition or removal of directed/undirected edges, or reversal of a directed edge), with the resulting CPDAG \(\mathcal{P^c}^\prime \). Of all legal transformations, the one with the maximal difference \(Score(\mathcal{P^c}^\prime) - Score(\mathcal{P^c})\) is accepted as locally optimal, and \(\mathcal{P^c}^\prime\) is set as the current state. The algorithm iterates over transformations at each equivalence class until no further transformations can be found to improve the fit. Notably, the \(Score\) function is an estimate of how well the independence constraints of a DAG matching \(\mathcal{P^c}\) fit with the observed/simulated data.
The main limitations of GES is that (1) it, like many constraint-based approaches, relies on the same assumptions that are subject to violation in real-world systems, and (2) performs best in the large/infinite sample limit rather than realistic sample sizes.
Functional approaches
The causal models we've considered so far have used binary variables to generate samples. For the simple model \(A \rightarrow B\), where \(\mathbb{P}(A)=0.7\), \(\mathbb{P}(B|A=0)=0.4\) and \(\mathbb{P}(B|A=1)=0.1\), the relationship between \(A\) and \(B\) looks like:
We can formalize the causal relationship between \(A\) and \(B\) as a function:
$$B := f(A)$$
where the function \(f(A)\) assigns 0 or 1 to \(B\) based on the conditional probabilities above. The function \(f\) could actually be any (possibly non-linear) function, and in general, functional approaches define the causal model as follows 18. Given an observed variable \(x_i\), where \(i\) is a node in graph \(\mathcal{G}\), its value is determined as:
$$ x_i := f(\textbf{x}_{\text{Pa}(i)}) + n_i $$
where \(\textbf{x}_{\text{Pa}(i)}\) are the variables corresponding to the parent nodes of \(i\), and \(n_i\) is an unobserved noise variable that is jointly independent of the noise for other observed variables (see the "causal sufficiency" graph above).
Hoyer and colleagues 18 propose an algorithm with the following steps:
- Start with a candidate DAG \(\mathcal{G_c}\)
- For each node \(i \in \mathcal{G_c}\):
- For each parent node \(j \in \text{Pa}(i)\):
- Determine whether \(x_i \! \centernot{\perp \!\!\! \perp} \! x_j\)
- If not, reject \(\mathcal{G_c}\)
- Perform a nonlinear regression on the model \(x_i := f(\textbf{x}_{\text{Pa}(i)}) + n_i\)
- Obtain the estimated residual error \(\hat{n_i} = x_i - \hat{f}(\textbf{x}_{\text{Pa}(i)})\)
- For each variable \(x_j \in \textbf{x}_{\text{Pa}(i)}\):
- Determine whether \(\hat{n} \! \perp \!\!\! \perp \! x_j\)
- If not, reject \(\mathcal{G_c}\)
- For each parent node \(j \in \text{Pa}(i)\):
- If \(\mathcal{G_c}\) is not rejected by the above steps, conclude that it is consistent with the data
This requires an iteration over the space of all possible DAGs \(\mathcal{G_c}\). The obvious limitation of this brute-force approach is that the space of possible DAGs grows superexponentially with \(N\), and thus the size of models testable in this way is very limited. Additionally, many DAGs may be accepted by this approach, resulting in a large equivalence class of graphs 19. A large number of independence inferences are also being made here, running into the same family-wise error issue as is discussed above.
The main advantage of the functional approach is that, for certain types of functional relationships and noise distributions, the non-invertibility of functional models can be used to determine directionality in causal graphs. For this, non-Gaussian distributions can be handy as they are typically less readily inverted.
Another interesting functional approach is called Linear Non-Gaussian Acyclic Model (LiNGAM), proposed by Shimizu and colleagues 20. The method requires that each variable \(x_i\) is determined as a linear function of its ancestors:
$$ x_i = \sum_{k(j) \lt k(i)}{b_{ij}x_j + n_i} $$
where \(k(i)\) denotes a causal ordering in a DAG. The error term \(n_i\) must be continuous and non-Gaussian.
Given these conditions, the authors use independent components analysis (ICA), and some ingenuous permutation steps, to estimate directed relationships between each pair of nodes in a graph. Since ICA is a well-studied procedure with many available solvers, this approach is also applicable to large-scale systems. It does, however, depend on the assumption of causal sufficiency.
Continuous optimization approaches
Continuous optimization approaches are essentially an extension of score-based approaches, as all forms of mathematical optimization require some sort of objective function to optimize. These methods reformulate the score-based approaches described above by replacing the search over DAGs with a smooth (differentiable) real-valued function that can be optimized using more powerful numerical approximators, such as gradient descent, that are not applicable to the former.
Zheng and colleagues 21 first proposed such a function \(h(W)\), where \(W\) is a weighted adjacency matrix representing a graph \(\mathcal{G}\). This function is both smooth and assumes a value of zero if and only if \(\mathcal{G}\) is a DAG. Despite being smooth (and thus differentiable), the constraint \(h(W)=0\) results in a non-convex function, limiting the optimization to finding stationary points rather than global minima, using the Lagrange multiplier. The constrained optimization problem looks like this:
$$ \min_{W \in \mathbb{R}^{d \times d}} ~~~ \frac{1}{2n} ||\textbf{X} - \textbf{X} W||^2_F ~~~ + ~~~ \lambda ||W||_1$$ $$ \text{subject to} ~~~ h(W) = 0 $$
where \(\textbf{X}\) is an \(n \times d\) data matrix and \(W\) has zero diagonal. There's really too much going on in this approach to unpack here, but in a nutshell: (1) the first term specifies a loss function, which quantifies the goodness of fit of a linear regression relating the observed data of node \(i\) with its parents in some candidate graph \(\mathcal{G_c}\), represented as \(W\); and (2) the \(\lambda\) term penalizes denser graphs, thus encouring sparse solutions.
Efficient neural causal discovery without acyclicity constraints (ENCO) 22 is an extension of the above formulation, that replaces the acyclicity constraint entirely. They accomplish this by (1) modelling edges as two parameters, \(\gamma \in \mathbb{R}^{d \times d}\) as the existence of an edge and \(\theta \in \mathbb{R}^{d \times d}\) as its direction, where \(\theta_{ij} = -\theta_{ji}\); and (2) requiring interventional data, i.e., that interventions have been carried out on the sample and the resulting distribution observed.
As shown above 23, ENCO entails two training stages: distribution fitting and graph fitting. The first stage uses a neural network per variable, modelling it as a distribution conditional on all the other variables and determining which subset best fits the observed data. The second stage fits the graph parameters \(\gamma\) and \(\theta\) after applying interventions to each variable, thus estimating directionality.
Notably, while ENCO does not guarantee acyclicity, it generally produces acyclic graphs for datasets where the generating process is acyclic. It also performs well in the presence of deterministic variables, and can also be used to discover the presence of latent confounders. Its main drawback is the necessity of interventional data, which is not typically available in observational datasets.
Summing up
Given its enormous potential, causal discovery is a very active field of research, and many promising extensions to the initial PC algorithm have been proposed and tested. These approaches are improving both in scale and efficiency, but they typically perform best on fully-contained simulated examples, rather than real-world systems where the presence of confounders is ubiquitous. Moreover, the more promising approaches, such as ENCO, require interventional data, which is not always available, feasible, or even practically possible to collect (e.g., genetics or molecular/neuronal data).
On the other hand, the existence of these methods does suggest that it is possible to infer much more about causal relationships on the basis of observational data than is generally understood in scientific research, and these methods have likely not been fully embraced in many fields where they could be informative (or even transformative). Similarly, the more powerful methods that require (at least partial) interventional data can help guide research designs in many fields, including neuroscience (my own field).
My next steps are baby steps: dig into my own research ideas for ways to incorporate causal thinking into existing data, and design pilot experiments with causal inference in mind. Any insights gained in this process will be the subject of future blog posts.
Hopefully this post has inspired similar plans for anyone with the fortitude to have read this far 🤓😅.
-
This is a simplified version: the principles of relativity are also relevant here. Event \(A\) can only be caused by \(B\) if \(a\) is situated in its future light cone, since forces (matter/energy) can only act at speeds limited by the speed of light. Similarly, temporal precedence has to be defined with reference to an inertial frame of reference. Our discussion here presumes these conditions to be satisfied. ↩︎
-
These are more thoroughly discussed in this Medium post by Styppa, Wahl, and Runge. ↩︎
-
This article explores the concept of cancellation more formally, especially where it is systematic rather than "coincidental". The author argues that many examples of failures of the faithfulness condition actually equate to models where a common-cause variable is omitted (thus violating the causal sufficiency assumption). Including these variables can in many cases mitigate the failure. The logic is weak regarding biological homeostasis, however... ↩︎
-
PC stands for the first names of the authors who originally proposed this approach, Peter Spirtes and Clark Glymour (original article here). ↩︎
-
The Python code for the sample generation and PC algorithm can be found under the relevant entry in the Code tab. The sample used to produce these numbers is here: CSV file ↩︎ ↩︎
-
We are using logistic regression here because our variables are binary. With continuous variables this can be substituted for an ordinary linear regression model. ↩︎
-
Of course, failure to reject the null hypothesis is not equivalent to accepting the null hypothesis. This is one point of criticism for this method, in addition to the many independent tests it is required to perform (accumulating family-wise error). Establishing the power of our sample to capture an effect is one way to mitigate the former; limiting the false discovery rate is a way to deal with the latter. ↩︎
-
This is called an unshielded triple in the literature. ↩︎
-
We could actually more efficiently apply this knowledge first, since the arrows we know to be true can help with the process of narrowing down configurations for those we don't know. ↩︎
-
See this paper for an introduction into MAGs. ↩︎
-
Humphreys P & Freedman D (1999). The grand leap. British Journal for the Philosophy of Science 47(1), 113-123. doi: 10.1093/bjps/47.1.113 ↩︎
-
Perhaps this is simply how philosophers like to converse? ¯\_(ツ)_/¯ ↩︎
-
On the other hand, later variants of this algorithm have been applied with success in scientific studies; see an overview here. ↩︎
-
This is based on a recent review by Wang and colleagues. ↩︎
-
Spirtes PL (2001). An Anytime Algorithm for Causal Inference. Proceedings of Machine Learning Research R3, 278-285. url: ttps://proceedings.mlr.press/r3/spirtes01a.html ↩︎
-
Claassen T & Heskes T (2012). A Logical Characterization of Constraint-Based Causal Discovery.arXiv:1202.3711. doi: 10.48550/arXiv.1202.3711 ↩︎
-
Chickering DM (2002). Learning Equivalence Classes of Bayesian-Network Structures. J Machine Learn Res 2, 445-498. url: https://www.jmlr.org/papers/volume2/chickering02a/chickering02a.pdf ↩︎
-
See: Hoyer PO et al. (2008). Nonlinear causal discovery with additive noise models. NeurIPS Proceedings. url. ↩︎ ↩︎
-
It seems, however, fairly straightforward to combine the score-based and functional approaches, to, e.g., narrow the search space to equivalence classes rather than DAGs... ↩︎
-
Shimizu et al. (2006). A Linear Non-Gaussian Acyclic Model for Causal Discovery. J Machine Learn Res 7, 2003-2030. url: https://jmlr.org/papers/v7/shimizu06a.html ↩︎
-
Zheng X et al. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning. arXiv: 1803.01422. doi: arXiv.1803.01422 ↩︎
-
Lippe P et al. (2021). Efficient Neural Causal Discovery without Acyclicity Constraints. arXiv: 2107.10483. doi: arXiv.2107.10483 ↩︎
-
Image taken from Lippe P et al. (2021). Shared under license CC BY 4.0. ↩︎