Mimic3 Demo-Ver2#
Mimic3 is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.
Due to the privacy concerns, we utilized a subset of he original Mimic3 data that is publicly available on Kaggle. For illustration purpose, we selected several representative features for the following analysis:
Glucose: glucose values of patients
PaO2_FiO2: The partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio.
SOFA: Sepsis-related Organ Failure Assessment score to describe organ dysfunction/failure.
iv-input: the volumn of fluids that have been administered to the patient.
died_within_48h_of_out_time: the mortality status of the patient after 48 hours of being administered.
In the next sections, we will start from causal discovery learning to learn significant causal diagram from the data, and then quantify the effect of treatment (‘iv_input’) on the outcome (mortality status, denoted by ‘died_within_48h_of_out_time’ variable in the data) through causal effect learning.
Causal Discovery Learning#
%load_ext autoreload
%autoreload 2
##### Import Packages
from utils import *
from notear import *
from numpy.random import randn
from random import seed as rseed
from numpy.random import seed as npseed
import numpy as np
import pandas as pd
import os
import pickle
import random
import math
import time
from datetime import datetime
import matplotlib.pyplot as plt
from multiprocessing import Pool
from tqdm import tqdm
from functools import partial
os.environ["OMP_NUM_THREADS"] = "1"
mimic3_data = pd.read_csv('mimic3_single_stage.csv')
mimic3_data.loc[mimic3_data['Died within 48H'] == -1.0,'Died within 48H']=0
mimic3_data.head(6)
icustayid | Glucose | PaO2 | PaO2_FiO2 | IV Input | SOFA | Died within 48H | |
---|---|---|---|---|---|---|---|
0 | 1006 | 152.000000 | 100.200000 | 137.081590 | 2.800000 | 7.600000 | 0.0 |
1 | 1204 | 138.794872 | 127.782051 | 430.668956 | 1.153846 | 6.153846 | 1.0 |
2 | 4132 | 129.364286 | 123.956461 | 252.883864 | 3.000000 | 4.600000 | 0.0 |
3 | 4201 | 145.580087 | 118.083333 | 539.065657 | 1.363636 | 5.818182 | 1.0 |
4 | 5170 | 174.525000 | 147.350198 | 394.616727 | 2.437500 | 4.125000 | 1.0 |
5 | 6504 | 106.081169 | 88.836364 | 423.030303 | 0.363636 | 5.090909 | 1.0 |
# ----------- Estimated DAG based on NOTEARS
mimic3_data_final = mimic3_data
selected = ['Glucose', 'PaO2_FiO2', 'IV Input', 'SOFA', 'Died within 48H']
sample_demo = mimic3_data_final[selected]
est_mt = notears_linear(np.array(sample_demo), lambda1=0, loss_type='l2',w_threshold=0.1)
# ----------- Refit Associated Matrix under LSEM
est_mt, _ = refit(sample_demo, est_mt, selected)
# ----------- Plot Associated Estimated DAG based on NOTEARS
#plot_net(est_mt, labels_name=selected, file_name='demo_res_net')
#topo_list = np.array(selected)[list(nx.topological_sort(nx.DiGraph(est_mt)))].tolist()
#topo_list.reverse()
#print('Topological order from top to buttom:\n', topo_list)
Causal Effect Learning#
For simplicity, we split the data into two treatment groups: “High-IV-Input” group where the IV input is greater than or equal to \(1\), and “Low-IV-Input” group where the IV input is smaller than \(1\). We are interested in whether the highe level fluid intake treatment is able to decrease the SOFA score and the death rate of patients within 48 hours of administration.
Motivated by this problem, we set the “High-IV-Input” group as the treatment group with \(A=1\), and set the “Low-IV-Input” group as the control group with \(A=0\).
data_CEL_selected = mimic3_data.copy()
data_CEL_selected.loc[data_CEL_selected['IV Input']>=1,'IV Input']=1 # change the discrete action to binary
data_CEL_selected.loc[data_CEL_selected['IV Input']<1,'IV Input']=0 # change the discrete action to binary
data_CEL_selected.head(6)
icustayid | Glucose | PaO2 | PaO2_FiO2 | IV Input | SOFA | Died within 48H | |
---|---|---|---|---|---|---|---|
0 | 1006 | 152.000000 | 100.200000 | 137.081590 | 1.0 | 7.600000 | 0.0 |
1 | 1204 | 138.794872 | 127.782051 | 430.668956 | 1.0 | 6.153846 | 1.0 |
2 | 4132 | 129.364286 | 123.956461 | 252.883864 | 1.0 | 4.600000 | 0.0 |
3 | 4201 | 145.580087 | 118.083333 | 539.065657 | 1.0 | 5.818182 | 1.0 |
4 | 5170 | 174.525000 | 147.350198 | 394.616727 | 1.0 | 4.125000 | 1.0 |
5 | 6504 | 106.081169 | 88.836364 | 423.030303 | 0.0 | 5.090909 | 1.0 |
print( "The number of patients in treatment group is ", len(np.where(mimic3_data['IV Input']>=1)[0]), ";\n", "The number of patients in control group is ", len(np.where(mimic3_data['IV Input']<1)[0]),".")
The number of patients in treatment group is 41 ;
The number of patients in control group is 16 .
Regard ‘Died_Within_48H’ as the outcome variable#
n = len(data_CEL_selected)
userinfo_index = ['Glucose', 'PaO2_FiO2']
SandA_index = ['Glucose', 'PaO2_FiO2', 'IV Input']
# outcome: Died within 48H (binary)
# treatment: IV Input (binary)
# Glucose, PaO2_FiO2: covariates
SandA = data_CEL_selected.loc[:, SandA_index]
#from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
np.random.seed(0)
S_learner = GradientBoostingClassifier(max_depth=2)
#S_learner = LinearRegression()
#SandA = np.hstack((S.to_numpy(),A.to_numpy().reshape(-1,1)))
S_learner.fit(SandA, data_CEL_selected['Died within 48H'])
GradientBoostingClassifier(max_depth=2)
SandA_all1 = SandA.copy()
SandA_all0 = SandA.copy()
SandA_all1.loc[:,'IV Input']=np.ones(n)
SandA_all0.loc[:,'IV Input']=np.zeros(n)
HTE_S_learner = S_learner.predict_proba(SandA_all1)[:,1] - S_learner.predict_proba(SandA_all0)[:,1]
HTE_S_learner
array([-0.7935056 , -0.05573883, -0.17789437, -0.01291861, -0.00327331,
-0.03616324, -0.15807825, -0.02133172, -0.00817926, -0.01291861,
-0.02133172, -0.03642929, -0.00494878, -0.00934406, -0.00510541,
-0.00866307, -0.70943365, -0.00494878, -0.05460816, -0.17366209,
-0.01714046, -0.01291861, -0.00543908, -0.00765503, -0.84943974,
-0.00934406, -0.00934406, -0.00934406, -0.00543908, -0.35520862,
-0.4361084 , -0.01714046, -0.01714046, -0.00543908, -0.00494878,
-0.00866307, -0.00567393, -0.00327331, -0.02133172, -0.00543908,
-0.00934406, -0.05460816, -0.05460816, -0.00327331, -0.10858887,
-0.54647282, -0.00934406, -0.01714046, -0.01291861, -0.00327331,
-0.00934406, -0.26561005, -0.0139195 , -0.02708104, -0.00327331,
-0.35007447, -0.87203275])
As we can see from the estimated treatment effect of each patient, a higher volumn of fluid intake is inclined to cause negative impact on patients’ health status. This may seem counterintuitive to us, which may indicates some selection bias within this small dataset. Despite so, this result also remind us to pay attention to the potentially unnecessary fluid intake that may increase the death rate of patients.
(S_learner.predict(SandA_all1) - S_learner.predict(SandA_all0))
array([-1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., -1., 0., 0., 0., 0., 0., 0., 0., -1., 0.,
0., 0., 0., 0., -1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., -1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., -1.])
Specifically, the S-learner model advises against switching 6 patients from the ‘High-IV-input’ group to the ‘Low-IV-input’ group, as excessive IV fluid administration may increase the risk of death.
sum(HTE_S_learner)/len(data_CEL_selected)
-0.11361078792742538
Overall, IV Input is expected to increase the death-within-48-hours rate of all patients by 12.71%.
Regard ‘SOFA’ as the outcome variable#
userinfo_index = np.array([1,2])
# outcome: SOFA score (treated as continuous). The smaller, the better
# treatment: iv_input (binary)
# Glucose, PaO2_FiO2: covariates
data_CEL_selected.head(6)
icustayid | Glucose | PaO2 | PaO2_FiO2 | IV Input | SOFA | Died within 48H | |
---|---|---|---|---|---|---|---|
0 | 1006 | 152.000000 | 100.200000 | 137.081590 | 1.0 | 7.600000 | 0.0 |
1 | 1204 | 138.794872 | 127.782051 | 430.668956 | 1.0 | 6.153846 | 1.0 |
2 | 4132 | 129.364286 | 123.956461 | 252.883864 | 1.0 | 4.600000 | 0.0 |
3 | 4201 | 145.580087 | 118.083333 | 539.065657 | 1.0 | 5.818182 | 1.0 |
4 | 5170 | 174.525000 | 147.350198 | 394.616727 | 1.0 | 4.125000 | 1.0 |
5 | 6504 | 106.081169 | 88.836364 | 423.030303 | 0.0 | 5.090909 | 1.0 |
Similarly, we estimate the causal effect of fluid administration on the average SOFA score of patients to see if higher IV input is able to decrease the SOFA score.
#from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
#mu0 = LGBMRegressor(max_depth=2)
#mu1 = LGBMRegressor(max_depth=2)
mu0 = LinearRegression()
mu1 = LinearRegression()
mu0.fit(data_CEL_selected.iloc[np.where(data_CEL_selected['IV Input']==0)[0],userinfo_index],data_CEL_selected.loc[data_CEL_selected['IV Input']==0,'SOFA'] )
mu1.fit(data_CEL_selected.iloc[np.where(data_CEL_selected['IV Input']==1)[0],userinfo_index],data_CEL_selected.loc[data_CEL_selected['IV Input']==1,'SOFA'] )
# estimate the HTE by T-learner
HTE_T_learner = (mu1.predict(data_CEL_selected.iloc[:,userinfo_index]) - mu0.predict(data_CEL_selected.iloc[:,userinfo_index]))
HTE_T_learner
array([-0.20963072, 0.36363165, 0.45134661, 0.13025089, 0.10068802,
0.31679702, -0.36382407, -1.07217217, 1.03274754, 0.28356984,
-0.85172173, -2.90975369, -0.31952385, 0.49851422, 0.1926697 ,
0.08880954, -0.13806398, 0.91018533, -0.38592831, 0.30080782,
0.81897977, 1.22512521, 0.56978246, 1.97367999, -1.06606889,
0.46204494, 0.56211986, 0.36768737, 0.97789444, 0.76346229,
-0.06702501, 0.3791031 , 0.73897488, 0.87357018, 1.85450739,
-0.38658623, -1.55869654, -0.18709482, -1.49089899, 1.05070015,
0.02144515, 0.07582235, 0.02950435, -0.60493756, 1.34331682,
0.13339107, 0.74942471, 0.34136664, 2.20697202, -0.89639628,
0.67296089, 0.52157244, 2.55823332, 0.01110759, -0.05096632,
-0.13785416, -0.00794219])
Although for some patients, higher volumn of fluid intake is able to decrease their overall SOFA score, most of the rest of the patients suffered some bad effects from it.
sum(HTE_T_learner)/len(data_CEL_selected)
0.23241547432382942
Conclusion: IV Input is expected to increase the SOFA score by 0.259.