ATE Estimation#

Average treatment effect (ATE), is a measure used to compare treatments (or interventions) in randomized experiments, evaluation of policy interventions, and medical trials. As we’ve introduced in the preliminary section, it aims to estimate the difference of some reward function in between treatment and control. Under the potential outcome’s framework, or the notation of do-operator in RL-based literature, our main purpose lies in estimating and inferring on

(14)#\[\begin{equation} \text{ATE} = \mathbb{E}[R^*(1) - R^*(0)] = \mathbb{E}[ R|do(A=1)] - \mathbb{E}[ R|do(A=0)]. \end{equation}\]

Assumptions#

Under three common assumptions in causal inference, we can identify, or consistently estimate the potential outcome \(\{R(0),R(1)\}\) from the observed data. These assumptions are 1) Consistency, 2) No unmeasured confounders (NUC), and 3) Positivity assumption.

  • SUTVA (Stable Unit Treatment Value Assumption, or Consistency Assumption)

(15)#\[\begin{align} R_i = R_i^*(1) A_i + R_i^*(0) (1-A_i), i = 1, \cdots, n. \end{align}\]

That is, the actual response that is observed for the \(i\)-th individual in our sample, \(R_i\), who received treatment \(A_i\), is the same as the potential outcome for that assigned treatment regardless of the experimental conditions used to assign treatment. This assumption also indicates that there is no interference between subjects in the population, that is, the observed response for any individual is not affected by the responses of other individuals.

  • NUC (No Unmeasured Confounders Assumption, or Strong Ignorability)

(16)#\[\begin{align} \{R^*(0), R^*(1)\} \perp\!\!\!\perp A|S \end{align}\]

We cannot use the data in an observational study to either confirm or refute this assumption. However, if we believe that \(X\) contained all the relevant information about the treatment assignment, then it will be reasonable to make the above assumption.

  • Positivity Assumption

(17)#\[\begin{align} 0 < P(A=1|S=s) < 1 \end{align}\]

This assumption ensures that for any given individual in treatment group \(1\), we are able to find a similar individual in treatment group \(0\), and vice versa.

We remark that these three assumptions are commonly imposed in causal inference and individualized treatment regimes literature [1]. Moreover, NUC and Positivity assumptions are automatically satisfied in randomized studies where the behavior policy is usually a strictly positive function independent of \(s\).

Identification#

Based on the three assumptions above, we can re-write ATE as a function of the observed data. The details are shown below:

(18)#\[\begin{equation} \begin{aligned} \mathbb{E}_{R^*(1)}[R^*(1)] &= \mathbb{E}_S[\mathbb{E}_{R^*(1)}[R^*(1)|S]] \quad \text{(Law of Total Expectation)} \\ &= \mathbb{E}_S[\mathbb{E}_{R^*(1)}[R^*(1)| S,A=1]] \quad \text{(NUC: $\{R^*(0), R^*(1)\} \perp\!\!\!\perp A|S$)}\\ &= \mathbb{E}_S[\mathbb{E}_{R}[R|S,A=1]] \quad\quad \text{(SUTVA: $R_i = R_i^*(1) A_i + R_i^*(0) (1-A_i)$)} \end{aligned} \end{equation}\]

Similarly, we can show that \(E_{R^*(0)}[R^*(0)] = E_S[E_{R}\{R|A=0, S\}].\) Hence, the average causal treatment effect can be written as

(19)#\[\begin{equation} \begin{aligned} \text{ATE} = \mathbb{E}_S[\mathbb{E}_{R}[R|S,A=1]] - \mathbb{E}_S[\mathbb{E}_{R}[R| S,A=0]] \end{aligned} \end{equation}\]

where the RHS we get rid of the potential outcome’s notations and thus can be identified from purely the observed data.

Estimation#

In this section, we will introduce three categories of estimators that have been widely used in ATE estimation: direct method (DM), importance sampling estimator (IS), and doubly robust estimator (DR).

1. Direct Method#

The first approach to estimate the average treatment effect is through direct model fitting based on the identification result, which is also known as the outcome regression model.

Specifically, we can posit a model for \(E[R| S,A] = \mu(S, A;\gamma)\), and estimate the parameter \(\gamma\) via least square or any other machine learning-based tools for model fitting. Then, The DM estimator of ATE is given by

(20)#\[\begin{equation} \begin{aligned} \widehat{\text{ATE}}_{\text{DM}} = \frac{1}{n}\sum_{i=1}^n \{\mu(S_i, 1;\hat{\gamma}) - \mu(S_i, 0; \hat{\gamma})\} \end{aligned} \end{equation}\]

This specific estimation procedure, as will be detailed in later sections, is the same as implementing S-learner and averaging over all heterogeneous treatment effects to obtain the ATE. If we fit the regression models separately for \(A=1\) and \(A=0\), the resulting HTE estimation method becomes T-learner. We will elaborate them in the next section.

Next, we implement direct method on the MovieLens dataset to provide users with a detailed ATE estimation procedure.

# import related packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt;
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
import os
from causaldm.learners.CEL.Single_Stage import _env_getdata_CEL

# Get the MovieLens data
MovieLens_CEL = _env_getdata_CEL.get_movielens_CEL()
MovieLens_CEL.pop(MovieLens_CEL.columns[0])

# Remove irrelevant columns
MovieLens_CEL = MovieLens_CEL[MovieLens_CEL.columns.drop(['Comedy','Action', 'Thriller', 'Sci-Fi'])]
MovieLens_CEL
user_id movie_id rating age Drama gender_M occupation_academic/educator occupation_college/grad student occupation_executive/managerial occupation_other occupation_technician/engineer
0 48.0 1193.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
1 48.0 919.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
2 48.0 527.0 5.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
3 48.0 1721.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
4 48.0 150.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
65637 5878.0 3300.0 2.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
65638 5878.0 1391.0 1.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
65639 5878.0 185.0 4.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
65640 5878.0 2232.0 1.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
65641 5878.0 426.0 3.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

65642 rows × 11 columns

n = len(MovieLens_CEL) 
userinfo_index = np.array([3,5,6,7,8,9,10])
SandA = MovieLens_CEL.iloc[:, np.array([3,4,5,6,7,8,9,10])]
# S-learner
np.random.seed(0)
S_learner = GradientBoostingRegressor(max_depth=5)
S_learner.fit(SandA, MovieLens_CEL['rating'])
GradientBoostingRegressor(max_depth=5)
SandA_all1 = SandA.copy()
SandA_all0 = SandA.copy()
SandA_all1.iloc[:,1]=np.ones(n)
SandA_all0.iloc[:,1]=np.zeros(n)

ATE_DM = np.sum(S_learner.predict(SandA_all1) - S_learner.predict(SandA_all0))/n
ATE_DM
0.35633351350901915

That is, people are more inclined to give higher ratings to drama than science fictions. The expected rating difference given by direct method is 0.356.

2. Importance Sampling Estimator#

The second type of estimators is called importance sampling (IS) estimator, or inverse propensity score (IPW) and augmented inverse propensity score (AIPW) in causal inference literature.

Before we proceed, let’s define the propensity score as below:

(21)#\[\begin{equation} \begin{aligned} \pi(S) = P(A=1|S). \end{aligned} \end{equation}\]

In the case where there are only two treatments, it refers to the propensity of getting one of the treatments as a function of the covariates. One of the attractive features of the propensity score is that given the SUTVA and strong ignorability assumptions, we have

(22)#\[\begin{align} \{R^*(1), R^*(0)\} \perp\!\!\!\perp A|\pi(S) \end{align}\]

In some cases when it is difficult to fit an outcome regression model for \(E[R|S,A]\) and \(S\) may be high dimensional, we can alternatively fit models for \(E[R|\pi(S),A]\) based on the result above.

The motivation of IPW is as follows. Since the propensity (probability) of receiving treatment \(1\) is \(\pi(S)\) for an individual with baseline covariates \(S\), then every individual with covariates \(X_i\) that was observed to receive treatment 1 in our sample is weighted by \(1/\pi(S_i)\) so that their response not only represent themselves but also other individuals who did not receive treatment 1. More formally, we will show that \(\frac{AR}{\pi(S)}\) is an unbiased estimator of \(E[R^*(1)]\). This follows that, given the positivity assumption,

(23)#\[\begin{equation} \begin{aligned} \mathbb{E}_{R^*(1)}[R^*(1)] &= \mathbb{E}_{R^*(1)}\left[\mathbb{E}_{S,A}\bigg\{\frac{A}{\pi(S)}\Big| S\bigg\}R^*(1)\right] \quad \text{($\mathbb{E}_{S,A}\bigg\{\frac{A}{\pi(S)}\Big| S\bigg\}=1$)}\\ &= \mathbb{E}_{S}\left[\mathbb{E}_{A,R^*(1)}\bigg\{\frac{A*R^*(1)}{\pi(S)}\Big|S\bigg\}\right] \quad \text{(NUC: $\{R^*(1), R^*(0)\} \perp\!\!\!\perp A|\pi(S)$)}\\ &= \mathbb{E}_{A,R^*(1),S}\left[\frac{A*R^*(1)}{\pi(S)}\right] \quad \text{(Law of Total Expectation)}\\ &= \mathbb{E}_{A,R,S}\left[\frac{A*R}{\pi(S)}\right] \quad \text{(SUTVA: $R_i = R_i^*(1) A_i + R_i^*(0) (1-A_i)$)} \end{aligned} \end{equation}\]

Similarly, by flipping the role of treatment (\(A=1\)) and control (\(A=0\)), we have

(24)#\[\begin{equation} \begin{aligned} \mathbb{E}_{R^*(0)}[R*(0)] = \mathbb{E}_{A,R,S}\left[\frac{(1-A)R}{1-\pi(S)} \right]. \end{aligned} \end{equation}\]

Consequently, the IS (or IPW) estimator for the estimation of ATE is given by

(25)#\[\begin{equation} \begin{aligned} \widehat{\text{ATE}}_{\text{IS}} =\frac{1}{n}\sum_{i=1}^n \left\{\frac{A_iR_i}{\hat\pi(S_i)} - \frac{(1-A_i)R_i}{1-\hat\pi(S_i)} \right\}. \end{aligned} \end{equation}\]
# propensity score model fitting
from sklearn.linear_model import LogisticRegression

ps_model = LogisticRegression()
ps_model.fit(MovieLens_CEL.iloc[:, userinfo_index],  MovieLens_CEL['Drama'])
LogisticRegression()
pi_S = ps_model.predict_proba(MovieLens_CEL.iloc[:, userinfo_index])
ATE_IS = np.sum((MovieLens_CEL['Drama']/pi_S[:,1] - (1-MovieLens_CEL['Drama'])/pi_S[:,0])*MovieLens_CEL['rating'])/n
ATE_IS
0.35564234271940287

People watching Drama is expected to give ratings 0.356 point higher than watching Sci-Fi.

3. Doubly Robust Estimator#

The third type of estimator is the doubly robust estimator. Basically, DR combines DM and IS estimators, which is more robust to model misspecifications.

(26)#\[\begin{equation} \begin{aligned} \widehat{\text{ATE}}_{\text{DR}} = \frac{1}{n}\sum_{i=1}^n \left\{\mu(S_i,1;\hat{\gamma})- \mu(S_i,0;\hat{\gamma})+\frac{A_i(R_i - \mu(S_i,1;\hat{\gamma}))}{\hat{\pi}(S_i)} - \frac{(1-A_i)(R_i-\mu(S_i,0;\hat{\gamma}))}{1-\hat{\pi}(S_i)} \right\}. \end{aligned} \end{equation}\]

To be more specific, the first two terms on the RHS forms a direct estimator and the last two terms serve as an augmentation to correct the bias rising from outcome regression models. When either the outcome regression models or the propensity scores are correctly specified, \(\widehat{\text{ATE}}_{\text{DR}}\) can be proved to be consistent.

Under some mild entropy conditions or sample splitting, DR estimator is also a semi-parametrically efficient estimator when the convergence rate of both \(\hat{\mu}\) and \(\hat{\pi}\) are at least \(o(n^{-1/4})\). Details can be found in Chernozhukov et al. 2018 [2].

np.sum(MovieLens_CEL['Drama']*(MovieLens_CEL['rating']-S_learner.predict(SandA_all1))/pi_S[:,1] - (1-MovieLens_CEL['Drama'])*(MovieLens_CEL['rating']-S_learner.predict(SandA_all0))/pi_S[:,0])/n
0.000343355404777341
# combine the DM estimator and IS estimator
ATE_DR = ATE_DM + np.sum(MovieLens_CEL['Drama']*(MovieLens_CEL['rating']-S_learner.predict(SandA_all1))/pi_S[:,1] - (1-MovieLens_CEL['Drama'])*(MovieLens_CEL['rating']-S_learner.predict(SandA_all0))/pi_S[:,0])/n
ATE_DR
0.3566768689137965

After correcting the bias by doubly robust procedure, the expected rating of Drama is even slightly higher than that of Sci-Fi.

References#

  1. Zhang, B., A. A. Tsiatis, E. B. Laber, and M. Davidian (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 (3), 681–694.

  2. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters.