Mimic3 Demo-Ver2#

Mimic3 is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

Due to the privacy concerns, we utilized a subset of he original Mimic3 data that is publicly available on Kaggle. For illustration purpose, we selected several representative features for the following analysis:

  • Glucose: glucose values of patients

  • PaO2_FiO2: The partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio.

  • SOFA: Sepsis-related Organ Failure Assessment score to describe organ dysfunction/failure.

  • iv-input: the volumn of fluids that have been administered to the patient.

  • died_within_48h_of_out_time: the mortality status of the patient after 48 hours of being administered.

In the next sections, we will start from causal discovery learning to learn significant causal diagram from the data, and then quantify the effect of treatment (‘iv_input’) on the outcome (mortality status, denoted by ‘died_within_48h_of_out_time’ variable in the data) through causal effect learning.

Causal Discovery Learning#

%load_ext autoreload
%autoreload 2

##### Import Packages 
from utils import *
from notear import *
  
from numpy.random import randn
from random import seed as rseed
from numpy.random import seed as npseed

import numpy as np
import pandas as pd
import os
import pickle
import random
import math
import time 

from datetime import datetime

import matplotlib.pyplot as plt

from multiprocessing import Pool
 
from tqdm import tqdm
from functools import partial 
os.environ["OMP_NUM_THREADS"] = "1"
mimic3_data = pd.read_csv('mimic3_single_stage.csv')
mimic3_data.loc[mimic3_data['Died within 48H'] == -1.0,'Died within 48H']=0 
mimic3_data.head(6)
icustayid Glucose PaO2 PaO2_FiO2 IV Input SOFA Died within 48H
0 1006 152.000000 100.200000 137.081590 2.800000 7.600000 0.0
1 1204 138.794872 127.782051 430.668956 1.153846 6.153846 1.0
2 4132 129.364286 123.956461 252.883864 3.000000 4.600000 0.0
3 4201 145.580087 118.083333 539.065657 1.363636 5.818182 1.0
4 5170 174.525000 147.350198 394.616727 2.437500 4.125000 1.0
5 6504 106.081169 88.836364 423.030303 0.363636 5.090909 1.0
# ----------- Estimated DAG based on NOTEARS 

mimic3_data_final = mimic3_data  

selected = ['Glucose', 'PaO2_FiO2', 'IV Input', 'SOFA', 'Died within 48H']

sample_demo = mimic3_data_final[selected]
est_mt = notears_linear(np.array(sample_demo), lambda1=0, loss_type='l2',w_threshold=0.1)
 
# ----------- Refit Associated Matrix under LSEM 

est_mt, _ = refit(sample_demo, est_mt, selected) 

# ----------- Plot Associated Estimated DAG based on NOTEARS 

#plot_net(est_mt, labels_name=selected, file_name='demo_res_net')
#topo_list = np.array(selected)[list(nx.topological_sort(nx.DiGraph(est_mt)))].tolist()
#topo_list.reverse()
#print('Topological order from top to buttom:\n', topo_list)

MIMIC3.png

Causal Effect Learning#

For simplicity, we split the data into two treatment groups: “High-IV-Input” group where the IV input is greater than or equal to \(1\), and “Low-IV-Input” group where the IV input is smaller than \(1\). We are interested in whether the highe level fluid intake treatment is able to decrease the SOFA score and the death rate of patients within 48 hours of administration.

Motivated by this problem, we set the “High-IV-Input” group as the treatment group with \(A=1\), and set the “Low-IV-Input” group as the control group with \(A=0\).

data_CEL_selected = mimic3_data.copy()
data_CEL_selected.loc[data_CEL_selected['IV Input']>=1,'IV Input']=1 # change the discrete action to binary
data_CEL_selected.loc[data_CEL_selected['IV Input']<1,'IV Input']=0 # change the discrete action to binary

data_CEL_selected.head(6)
icustayid Glucose PaO2 PaO2_FiO2 IV Input SOFA Died within 48H
0 1006 152.000000 100.200000 137.081590 1.0 7.600000 0.0
1 1204 138.794872 127.782051 430.668956 1.0 6.153846 1.0
2 4132 129.364286 123.956461 252.883864 1.0 4.600000 0.0
3 4201 145.580087 118.083333 539.065657 1.0 5.818182 1.0
4 5170 174.525000 147.350198 394.616727 1.0 4.125000 1.0
5 6504 106.081169 88.836364 423.030303 0.0 5.090909 1.0
print( "The number of patients in treatment group is ", len(np.where(mimic3_data['IV Input']>=1)[0]), ";\n", "The number of patients in control group is ", len(np.where(mimic3_data['IV Input']<1)[0]),".")
The number of patients in treatment group is  41 ;
 The number of patients in control group is  16 .

Regard ‘Died_Within_48H’ as the outcome variable#

n = len(data_CEL_selected)
userinfo_index = ['Glucose', 'PaO2_FiO2']
SandA_index = ['Glucose', 'PaO2_FiO2', 'IV Input']
# outcome: Died within 48H (binary)
# treatment: IV Input (binary)
# Glucose, PaO2_FiO2: covariates
SandA = data_CEL_selected.loc[:, SandA_index]
#from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

np.random.seed(0)
S_learner = GradientBoostingClassifier(max_depth=2)
#S_learner = LinearRegression()
#SandA = np.hstack((S.to_numpy(),A.to_numpy().reshape(-1,1)))
S_learner.fit(SandA, data_CEL_selected['Died within 48H'])
GradientBoostingClassifier(max_depth=2)
SandA_all1 = SandA.copy()
SandA_all0 = SandA.copy()
SandA_all1.loc[:,'IV Input']=np.ones(n)
SandA_all0.loc[:,'IV Input']=np.zeros(n)

HTE_S_learner = S_learner.predict_proba(SandA_all1)[:,1] - S_learner.predict_proba(SandA_all0)[:,1]
HTE_S_learner
array([-0.7935056 , -0.05573883, -0.17789437, -0.01291861, -0.00327331,
       -0.03616324, -0.15807825, -0.02133172, -0.00817926, -0.01291861,
       -0.02133172, -0.03642929, -0.00494878, -0.00934406, -0.00510541,
       -0.00866307, -0.70943365, -0.00494878, -0.05460816, -0.17366209,
       -0.01714046, -0.01291861, -0.00543908, -0.00765503, -0.84943974,
       -0.00934406, -0.00934406, -0.00934406, -0.00543908, -0.35520862,
       -0.4361084 , -0.01714046, -0.01714046, -0.00543908, -0.00494878,
       -0.00866307, -0.00567393, -0.00327331, -0.02133172, -0.00543908,
       -0.00934406, -0.05460816, -0.05460816, -0.00327331, -0.10858887,
       -0.54647282, -0.00934406, -0.01714046, -0.01291861, -0.00327331,
       -0.00934406, -0.26561005, -0.0139195 , -0.02708104, -0.00327331,
       -0.35007447, -0.87203275])

As we can see from the estimated treatment effect of each patient, a higher volumn of fluid intake is inclined to cause negative impact on patients’ health status. This may seem counterintuitive to us, which may indicates some selection bias within this small dataset. Despite so, this result also remind us to pay attention to the potentially unnecessary fluid intake that may increase the death rate of patients.

(S_learner.predict(SandA_all1) - S_learner.predict(SandA_all0))
array([-1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,
        0.,  0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0., -1.])

Specifically, the S-learner model advises against switching 6 patients from the ‘High-IV-input’ group to the ‘Low-IV-input’ group, as excessive IV fluid administration may increase the risk of death.

sum(HTE_S_learner)/len(data_CEL_selected)
-0.11361078792742538

Overall, IV Input is expected to increase the death-within-48-hours rate of all patients by 12.71%.

Regard ‘SOFA’ as the outcome variable#

userinfo_index = np.array([1,2])
# outcome: SOFA score (treated as continuous). The smaller, the better
# treatment: iv_input (binary)
# Glucose, PaO2_FiO2: covariates
data_CEL_selected.head(6)
icustayid Glucose PaO2 PaO2_FiO2 IV Input SOFA Died within 48H
0 1006 152.000000 100.200000 137.081590 1.0 7.600000 0.0
1 1204 138.794872 127.782051 430.668956 1.0 6.153846 1.0
2 4132 129.364286 123.956461 252.883864 1.0 4.600000 0.0
3 4201 145.580087 118.083333 539.065657 1.0 5.818182 1.0
4 5170 174.525000 147.350198 394.616727 1.0 4.125000 1.0
5 6504 106.081169 88.836364 423.030303 0.0 5.090909 1.0

Similarly, we estimate the causal effect of fluid administration on the average SOFA score of patients to see if higher IV input is able to decrease the SOFA score.

#from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

#mu0 = LGBMRegressor(max_depth=2)
#mu1 = LGBMRegressor(max_depth=2)

mu0 = LinearRegression()
mu1 = LinearRegression()

mu0.fit(data_CEL_selected.iloc[np.where(data_CEL_selected['IV Input']==0)[0],userinfo_index],data_CEL_selected.loc[data_CEL_selected['IV Input']==0,'SOFA'] )
mu1.fit(data_CEL_selected.iloc[np.where(data_CEL_selected['IV Input']==1)[0],userinfo_index],data_CEL_selected.loc[data_CEL_selected['IV Input']==1,'SOFA'] )


# estimate the HTE by T-learner
HTE_T_learner = (mu1.predict(data_CEL_selected.iloc[:,userinfo_index]) - mu0.predict(data_CEL_selected.iloc[:,userinfo_index]))
HTE_T_learner
array([-0.20963072,  0.36363165,  0.45134661,  0.13025089,  0.10068802,
        0.31679702, -0.36382407, -1.07217217,  1.03274754,  0.28356984,
       -0.85172173, -2.90975369, -0.31952385,  0.49851422,  0.1926697 ,
        0.08880954, -0.13806398,  0.91018533, -0.38592831,  0.30080782,
        0.81897977,  1.22512521,  0.56978246,  1.97367999, -1.06606889,
        0.46204494,  0.56211986,  0.36768737,  0.97789444,  0.76346229,
       -0.06702501,  0.3791031 ,  0.73897488,  0.87357018,  1.85450739,
       -0.38658623, -1.55869654, -0.18709482, -1.49089899,  1.05070015,
        0.02144515,  0.07582235,  0.02950435, -0.60493756,  1.34331682,
        0.13339107,  0.74942471,  0.34136664,  2.20697202, -0.89639628,
        0.67296089,  0.52157244,  2.55823332,  0.01110759, -0.05096632,
       -0.13785416, -0.00794219])

Although for some patients, higher volumn of fluid intake is able to decrease their overall SOFA score, most of the rest of the patients suffered some bad effects from it.

sum(HTE_T_learner)/len(data_CEL_selected)
0.23241547432382942

Conclusion: IV Input is expected to increase the SOFA score by 0.259.