Single Stage – Paradigm 1#

Real Data 1. Movie Lens#

Movie Lens is a movie recommendation website that helps users to find movies and collect their ratings. The goal of the simulation studies in single stage causal effect learning is to infer on the causal effect of treating users ‘Drama’, versus the control movie genere ‘Sci-Fi’. This serves as an offline evaluation of how well people like/dislike a specific movie genere versus the other, and hence provides us a general scope of which movie genere to recommend so as to maximize users’ satisfaction.

Data Pre-processing#

# import related packages
import os
import pickle
import numpy as np

from causaldm.learners.CPL4.CMAB import _env_realCMAB as _env
data = _env.get_movielens()
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
data.keys()
dict_keys(['Individual', 'Xs', 'mean_ri', 'standardized_Xs'])
data_ML = data['Individual']
userinfo_index = np.array([3,9,11,12,13,14])

users_index = data_ML.keys()
n = len(users_index) # the number of users
movie_generes = ['Comedy', 'Drama', 'Action', 'Thriller', 'Sci-Fi']

data_CEL = {}
 
# initialize the final data we'll use in Causal Effect Learning
for i in movie_generes:
    data_CEL[i] = None   

import pandas as pd
for movie_genere in movie_generes:
      for user in users_index:
            data_CEL[movie_genere] = pd.concat([data_CEL[movie_genere] , data_ML[user][movie_genere]['complete']])
data_CEL['Comedy']
user_id movie_id rating age Comedy Drama Action Thriller Sci-Fi gender_M occupation_academic/educator occupation_college/grad student occupation_executive/managerial occupation_other occupation_technician/engineer
4220 48 2355.0 4.0 25.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
14400 48 2918.0 4.0 25.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
16752 48 2791.0 4.0 25.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
20195 48 2797.0 4.0 25.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
21689 48 2321.0 3.0 25.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
393463 5878.0 3299.0 3.0 25.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
395410 5878.0 892.0 5.0 25.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
396058 5878.0 574.0 1.0 25.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
397794 5878.0 1812.0 5.0 25.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
400719 5878.0 3830.0 1.0 25.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

49563 rows × 15 columns

data_CEL_all = pd.concat([data_CEL['Drama'], data_CEL['Sci-Fi']]) 
data_CEL_all = data_CEL_all.drop(columns=['Comedy', 'Action', 'Thriller', 'Sci-Fi'])
#data_CEL_all.to_csv("/Users/alinaxu/Documents/CDM/CausalDM/causaldm/data/MovieLens_CEL.csv")
data_CEL_all
user_id movie_id rating age Drama gender_M occupation_academic/educator occupation_college/grad student occupation_executive/managerial occupation_other occupation_technician/engineer
14 48 1193.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11057 48 919.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
25871 48 527.0 5.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
31166 48 1721.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
40383 48 150.0 4.0 25.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
303406 5878.0 3300.0 2.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
320275 5878.0 1391.0 1.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
332011 5878.0 185.0 4.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
382221 5878.0 2232.0 1.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
397209 5878.0 426.0 3.0 25.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

65642 rows × 11 columns

Final Movie Lens Data Selected for Causal Effect Learning (CEL)#

After pre-processing, the complete data contains 65,642 movie watching history of 175 individuals. We set treatment \(A=1\) when the user choose a ‘Drama’, and \(A=0\) if the movie belongs to ‘Sci-Fi’.

The processed data is saved in ‘causaldm/data/MovieLens_CEL.csv’ and will be directly used in later subsections.

Real Data 2. Mimic3#

https://www.kaggle.com/datasets/asjad99/mimiciii

Mimic3 is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

In causal effect learning, we try to estimate the treatment effect of conducting a specific intervention (e.g use of ventilator) to the patient, either given a particular patient’s characteristics and physiological information, or evaluate all patients treatment effect as a whole.

The original Mimic3 data was loaded from mimic3_sepsis_data.csv. For illustration purpose, we selected several representative features for the following analysis.

Data Pre-processing#

# import related packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt;
from sklearn.linear_model import LinearRegression
#from causaldm.data import mimic3_sepsis_data
# Get data
mimic3_data = pd.read_csv("mimic3_sepsis_data.csv")
mimic3_data.head(6)
bloc icustayid charttime gender age elixhauser re_admission died_in_hosp died_within_48h_of_out_time mortality_90d ... input_total input_4hourly output_total output_4hourly cumulated_balance SOFA SIRS vaso_input iv_input reward
0 1 3 7245486000 0 17639.826435 0 0 0 0 1 ... 6527.0000 50.0 13617.0 520.0 -7090.0000 5 1 0.0 2.0 -0.884898
1 1 11 6898241400 1 30766.069028 6 1 0 0 0 ... 0.0000 0.0 0.0 0.0 0.0000 12 0 0.0 0.0 0.383136
2 1 12 5805732000 1 12049.217303 0 0 0 0 0 ... 0.0000 0.0 0.0 0.0 0.0000 4 2 0.0 0.0 0.976040
3 1 14 4264269300 0 30946.970000 2 0 0 0 1 ... 1300.0000 1300.0 340.0 160.0 960.0000 5 2 0.0 4.0 0.125000
4 1 30 5707825200 0 19793.588912 6 0 0 0 0 ... 9552.0000 50.0 6830.0 540.0 2722.0000 6 2 0.0 2.0 0.457625
5 1 33 7214122800 0 24524.747419 5 0 1 1 1 ... 10661.0483 725.0 5746.0 360.0 4915.0483 4 0 0.0 4.0 1.049099

6 rows × 62 columns

selected = ['Glucose','paO2','PaO2_FiO2',  'iv_input', 'SOFA','reward']
n = 5000
mimic3_data_selected = mimic3_data[:n][selected]
mimic3_data_selected
Glucose paO2 PaO2_FiO2 iv_input SOFA reward
0 84.000000 84.000000 168.000000 2.0 5 -0.884898
1 122.000000 59.444444 198.148148 0.0 12 0.383136
2 125.000000 192.000000 690.647482 0.0 4 0.976040
3 110.727273 179.000000 447.499993 4.0 5 0.125000
4 187.000000 125.000000 347.222222 2.0 6 0.457625
... ... ... ... ... ... ...
4995 121.375000 136.787683 206.005547 3.0 4 -1.965110
4996 108.000000 62.333333 143.846153 0.0 11 -0.025000
4997 106.000000 258.500000 923.214286 0.0 7 0.402531
4998 144.000000 376.000000 752.000000 1.0 4 -0.172130
4999 113.000000 108.000000 269.999996 4.0 5 -0.025000

5000 rows × 6 columns

userinfo_index = np.array([0,1,2,4]) # record all the indices of patients' information
SandA = mimic3_data_selected.iloc[:, np.array([0,1,2,3,4])]

data_CEL_selected = mimic3_data_selected
data_CEL_selected.iloc[np.where(mimic3_data_selected['iv_input']!=0)[0],:] = 1
# change the discrete action to binary
data_CEL_selected.head(6)
Glucose paO2 PaO2_FiO2 iv_input SOFA reward
0 1.0 1.000000 1.000000 1.0 1 1.000000
1 122.0 59.444444 198.148148 0.0 12 0.383136
2 125.0 192.000000 690.647482 0.0 4 0.976040
3 1.0 1.000000 1.000000 1.0 1 1.000000
4 1.0 1.000000 1.000000 1.0 1 1.000000
5 1.0 1.000000 1.000000 1.0 1 1.000000

Final Mimic3 Data Selected for Causal Effect Learning (CEL)#

After pre-processing, we selected 4 features as the state variable in CEL, which represents the baseline information of the patients:

  • Glucose: glucose values of patients

  • paO2: The partial pressure of oxygen

  • PaO2_FiO2: The partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio.

  • SOFA: Sepsis-related Organ Failure Assessment score to describe organ dysfunction/failure.

The action variable is iv-input, which denotes the volumn of fluids that have been administered to the patient. Additionally, we set all non-zero iv-input values as \(1\) to create a binary action space.

The last column denotes the reward we evaluated according to the status of patients from several aspects.