Causal Discovery

Causal Discovery#

Most existing methodologies for average / heterogeneous treatment effects and personalized decision making rely on a known causal structure. This enables us to locate the right variables to control (e.g., confounders), to intervene (e.g., treatments), and to optimize (e.g., rewards). However, such a convenience is violated in many emerging real applications with unknown causal reasoning. Causal discovery thus attracts more and more attention recently to infer causal structure from data and disentangle the complex relationship among variables. In the following, we review the basic terminologies in causal discovery, discuss how these related to average / heterogeneous treatment effects and personalized decision making, and detail three classical ways of causal discovery.

General Causal Graph Terminology#

Consider a graph \(\mathcal{G} =(\mathbf{Z},\mathbf{E})\) with a node set \(\mathbf{Z}\) and an edge set \(\mathbf{E}\). A node \(Z_i\) is said to be a parent of \(Z_j\) if there is a directed edge from \(Z_i\) to \(Z_j\). Let the set of all parents of node \(Z_j\) in \(\mathcal{G}\) as \(PA_{Z_j} (\mathcal{G})\). A directed graph that does not contain directed cycles is called a directed acyclic graph (DAG). Suppose a DAG \(\mathcal{G}=(\mathbf{Z},\mathbf{E})\) that characterizes the causal relationship among \(|\mathbf{Z}|=d\) nodes, where \(\mathbf{Z}=[Z_1,Z_2,\cdots,Z_d]^\top \) represents a random vector and an edge \(Z_i\rightarrow Z_j\) means that \(Z_i\) is a direct cause of \(Z_j\).

Overview of Popular Causal Discovery Learners#

Causal discovery learners propose to learn a plusible causal graph from the observational data (up to Markovian Equivalent Class unless certain assumptions are satisfied for identifiability). Wide literature on causal discovery can be summarized in three classes (for models specified above).

The first type focuses on local conditional independence tests to find a causal skeleton and then determine the orientation of edges, such as the well-known PC algorithm (Spirtes et al., 2000; Kalisch & Bühlmann, 2007). However, testing the conditional independence of continuous variables is not easy (Shah & Peters, 2018).
The second class specifies properly functional causal models with additional assumptions on data distribution, including the ICA-LiNGAM (Shimizu et al., 2006) and the causal additive model (CAM) (Bühlmann et al., 2014).
The last class, the score-based method, includes the greedy equivalence search (GES) (Chickering, 2002) and the fast GES (fGES) (Ramsey et al., 2017) that use for example Bayesian scores in searching a space of causal models. Recently, Zheng et al. (2018) opened up another track of score-based methods by constructing an optimization with an acyclicity constraint under the linear structural equation model (LSEM), i.e. the NOTEARS. A follow-up work using a VAE parameterized by a graph neural network that generalizes LSEM was proposed in Yu et al. (2019) with a more computational friendly constraint, namely DAG-GNN. Also see Zhu & Chen (2019) and Cai et al. (2021) for other cutting-edge structural learning methods.

Learners Type	Supported Model	Noise Required for Training	Complexity	Scale-Free?
Testing based	Models 1	Gaussian	\(O(p^q)\)	Yes
Functional based	Models 1 & 2	non-Gaussian	\(O(p^3)\)	Yes
Score based	Models 1 & 3	Gaussian/non-Gaussian	\(O(p^3)\)	No

\(p\) is the number of nodes in \(\mathcal{G}\), and \(q\) is the max number of nodes adjacent to any nodes in \(\mathcal{G}\).

Causal Graphical Model 1: Linear Structural Equation Model#

Let \(B=\{b_{i,j}\}_{1\leq i\leq d,1\leq j\leq d}\) be a \(d\times d\) matrix, where \(b_{i,j}\) is the weight of the edge \(Z_i\rightarrow Z_j \in \mathbf{E}\), and \(b_{i,j}=0\) otherwise. Then, we say that \(\mathcal{G} =(\mathbf{Z},B)\) is a weighted DAG with the node set \(\mathbf{Z}\) and the weighted adjacency matrix \(B\) (the edge set \(\mathbf{E}\) is nested in \(B\)). Under no unmeasured confounders, the Markov condition, the faithfulness condition, causal sufficiency assumption, and the linear structural equation model (LSEM) such that \(\mathbf{Z}\) characterized by the pair (\(\mathcal{G}\), \(\epsilon\)) is generated by

(4)#\[\begin{equation}\label{lsem_x} \mathbf{Z}=B^\top \mathbf{Z} +\epsilon, \end{equation}\]

where \(\epsilon \) is a random vector of jointly independent error variables.

Causal Graphical Model 2: Additive Noise Model#

Suppose there exists a weighted DAG \(\mathcal{G}=(\mathbf{Z},\mathbf{E})\) that characterizes the causal relationship among \(|\mathbf{Z}|=d\) nodes. Each variable \(Z_i\) is associated with a node \(i\) in the DAG \(\mathcal{G}\), and the observed value of \(Z_i\) is obtained as a function of its parents in the graph plus an independent additive noise \(n_i\), i.e.,

(5)#\[\begin{equation}\label{anm} Z_i := f_i\{PA_{Z_i} (\mathcal{G})\} + n_i,i = 1,2,...,d, \end{equation}\]

where \(PA_{Z_i} (\mathcal{G})\) denotes the set of parent variables of \(Z_i\) so that there is an edge from \(Z_j\in PA_{Z_i} (\mathcal{G})\) to \(Z_i\) in the graph, and the noises \(n_i\) are assumed to be jointly independent. Here, Model 1 is a special case of Model 2.

Causal Graphical Model 3: Generalized LSEM#

To handle complex relationship, a generalized version of LSEM has been studied by Yu et al. (2019) as

(6)#\[\begin{equation} f_2(\mathbf{Z})=B^\top f_2(\mathbf{Z}) +f_1(\epsilon), \end{equation}\]

where the parameterized functions \(f_1\) and \(f_2\) effectively perform (possibly nonlinear) transforms on \(\epsilon\) and \(\mathbf{Z}\), respectively. Here, Model 1 is also a special case of Model 3.

Causal Discovery Methods To Be Detailed#

For Paradigm 1#

PC algorithm (Spirtes et al., 2000): set the Fisher-z test for conditional independence testing. The implementation is available through the py-causal package at https://github.com/bd2kccd/py-causal, written in highly optimized Java codes. Also see examples here https://github.com/bd2kccd/py-causal/blob/development/example/py-causal%20-%20PC-ALL%20in%20Action.ipynb.
ICA-LiNGAM (Shimizu et al., 2006): The ICA-LiNGAM assumes linear non-Gaussian additive model to recover the weighted adjacency matrix. The ICA-LiNGAM is implemented with default hyper-parameters through the lingam package for all settings. See their repository at https://github.com/cdt15/lingam.
NOTEARS (Zheng et al., 2018): The NOTEARS estimates the weighted adjacency matrix by formulating the optimization with an acyclicity constraint. The implementation is available at their repository at https://github.com/xunzheng/notears.
DAG-GNN (Yu et al., 2019): The DAG-GNN incorporates the variational auto-encoder into causal discovery with a modified smooth characterization on acyclicity in the evidence lower bound as the loss function. Codes are available at their repository at https://github.com/ fishmoon1234/DAG-GNN based on PyTorch (Paszke et al., 2017).
ANOCE-CVAE: The ANOCE-CVAE is constrained causal structure learning method by incorporating a novel identification constraint that specifies the temporal causal relationship of variables. The code is publicly available at an anonymous repository at https://github.com/anoce-cvae/ANOCE-CVAE.

For Paradigms 2&3#

Granger Causality - Granger (1969)
Time series fast PC - Entner & Hoyer (2010)
Momentary conditional independence (MCI) - Runge et al. (2019)
Vector Autoregressive (VAR) -LiNGAM - Hyvärinen et al. (2010)
Time-series Models with Independent Noise (TiMINo) - Peters et al. (2013)
Dynamic NOTEARS (DYNOTEARS) - Pamfil et al. (2020): Estimate contemporaneous (intra-slice) and time-lagged (interslice) relationships between variables in a time-series.
NTS-NOTEARS - Sun et al. (2021): Use neural networks to capture nonparametric time series data along with ensuring the acyclicity property of a DAG.

References#

[1] Judea Pearl et al. Causal inference in statistics: An overview. Statistics surveys, 3:96–146, 2009.

[2] Pater Spirtes, Clark Glymour, Richard Scheines, Stuart Kauffman, Valerio Aimale, and Frank Wimberly. Constructing bayesian network models of gene expression networks from microarray data. 2000.

[3] Markus Kalisch and Peter Bühlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(Mar):613–636, 2007.

[4] Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. arXiv preprint arXiv:1804.07203, 2018.

[5] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, and Antti Kerminen. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(Oct):2003–2030, 2006.

[6] Peter Bühlmann, Jonas Peters, Jan Ernest, et al. Cam: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526–2556, 2014.

[7] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.

[8] Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. International journal of data science and analytics, 3(2):121–129, 2017.

[9] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, pp. 9472–9483, 2018.

[10] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning with graph neural networks. arXiv preprint arXiv:1904.10098, 2019.

[11] Shengyu Zhu and Zhitang Chen. Causal discovery with reinforcement learning. arXiv preprint arXiv:1906.04477, 2019.

[12] Cai, Hengrui, Rui Song, and Wenbin Lu. “ANOCE: Analysis of Causal Effects with Multiple Mediators via Constrained Structural Learning.” International Conference on Learning Representations. 2020.