Single Stage – Paradigm 1#
Real Data 1. Movie Lens#
Movie Lens is a movie recommendation website that helps users to find movies and collect their ratings. The goal of the simulation studies in single stage causal effect learning is to infer on the causal effect of treating users ‘Drama’, versus the control movie genere ‘Sci-Fi’. This serves as an offline evaluation of how well people like/dislike a specific movie genere versus the other, and hence provides us a general scope of which movie genere to recommend so as to maximize users’ satisfaction.
Data Pre-processing#
# import related packages
import os
import pickle
import numpy as np
from causaldm.learners.CPL4.CMAB import _env_realCMAB as _env
data = _env.get_movielens()
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
data.keys()
dict_keys(['Individual', 'Xs', 'mean_ri', 'standardized_Xs'])
data_ML = data['Individual']
userinfo_index = np.array([3,9,11,12,13,14])
users_index = data_ML.keys()
n = len(users_index) # the number of users
movie_generes = ['Comedy', 'Drama', 'Action', 'Thriller', 'Sci-Fi']
data_CEL = {}
# initialize the final data we'll use in Causal Effect Learning
for i in movie_generes:
data_CEL[i] = None
import pandas as pd
for movie_genere in movie_generes:
for user in users_index:
data_CEL[movie_genere] = pd.concat([data_CEL[movie_genere] , data_ML[user][movie_genere]['complete']])
data_CEL['Comedy']
user_id | movie_id | rating | age | Comedy | Drama | Action | Thriller | Sci-Fi | gender_M | occupation_academic/educator | occupation_college/grad student | occupation_executive/managerial | occupation_other | occupation_technician/engineer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4220 | 48 | 2355.0 | 4.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
14400 | 48 | 2918.0 | 4.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
16752 | 48 | 2791.0 | 4.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
20195 | 48 | 2797.0 | 4.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
21689 | 48 | 2321.0 | 3.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393463 | 5878.0 | 3299.0 | 3.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
395410 | 5878.0 | 892.0 | 5.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
396058 | 5878.0 | 574.0 | 1.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
397794 | 5878.0 | 1812.0 | 5.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
400719 | 5878.0 | 3830.0 | 1.0 | 25.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
49563 rows × 15 columns
data_CEL_all = pd.concat([data_CEL['Drama'], data_CEL['Sci-Fi']])
data_CEL_all = data_CEL_all.drop(columns=['Comedy', 'Action', 'Thriller', 'Sci-Fi'])
#data_CEL_all.to_csv("/Users/alinaxu/Documents/CDM/CausalDM/causaldm/data/MovieLens_CEL.csv")
data_CEL_all
user_id | movie_id | rating | age | Drama | gender_M | occupation_academic/educator | occupation_college/grad student | occupation_executive/managerial | occupation_other | occupation_technician/engineer | |
---|---|---|---|---|---|---|---|---|---|---|---|
14 | 48 | 1193.0 | 4.0 | 25.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
11057 | 48 | 919.0 | 4.0 | 25.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
25871 | 48 | 527.0 | 5.0 | 25.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
31166 | 48 | 1721.0 | 4.0 | 25.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
40383 | 48 | 150.0 | 4.0 | 25.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
303406 | 5878.0 | 3300.0 | 2.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
320275 | 5878.0 | 1391.0 | 1.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
332011 | 5878.0 | 185.0 | 4.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
382221 | 5878.0 | 2232.0 | 1.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
397209 | 5878.0 | 426.0 | 3.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
65642 rows × 11 columns
Final Movie Lens Data Selected for Causal Effect Learning (CEL)#
After pre-processing, the complete data contains 65,642 movie watching history of 175 individuals. We set treatment \(A=1\) when the user choose a ‘Drama’, and \(A=0\) if the movie belongs to ‘Sci-Fi’.
The processed data is saved in ‘causaldm/data/MovieLens_CEL.csv’ and will be directly used in later subsections.
Real Data 2. Mimic3#
https://www.kaggle.com/datasets/asjad99/mimiciii
Mimic3 is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.
In causal effect learning, we try to estimate the treatment effect of conducting a specific intervention (e.g use of ventilator) to the patient, either given a particular patient’s characteristics and physiological information, or evaluate all patients treatment effect as a whole.
The original Mimic3 data was loaded from mimic3_sepsis_data.csv. For illustration purpose, we selected several representative features for the following analysis.
Data Pre-processing#
# import related packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt;
from sklearn.linear_model import LinearRegression
#from causaldm.data import mimic3_sepsis_data
# Get data
mimic3_data = pd.read_csv("mimic3_sepsis_data.csv")
mimic3_data.head(6)
bloc | icustayid | charttime | gender | age | elixhauser | re_admission | died_in_hosp | died_within_48h_of_out_time | mortality_90d | ... | input_total | input_4hourly | output_total | output_4hourly | cumulated_balance | SOFA | SIRS | vaso_input | iv_input | reward | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 7245486000 | 0 | 17639.826435 | 0 | 0 | 0 | 0 | 1 | ... | 6527.0000 | 50.0 | 13617.0 | 520.0 | -7090.0000 | 5 | 1 | 0.0 | 2.0 | -0.884898 |
1 | 1 | 11 | 6898241400 | 1 | 30766.069028 | 6 | 1 | 0 | 0 | 0 | ... | 0.0000 | 0.0 | 0.0 | 0.0 | 0.0000 | 12 | 0 | 0.0 | 0.0 | 0.383136 |
2 | 1 | 12 | 5805732000 | 1 | 12049.217303 | 0 | 0 | 0 | 0 | 0 | ... | 0.0000 | 0.0 | 0.0 | 0.0 | 0.0000 | 4 | 2 | 0.0 | 0.0 | 0.976040 |
3 | 1 | 14 | 4264269300 | 0 | 30946.970000 | 2 | 0 | 0 | 0 | 1 | ... | 1300.0000 | 1300.0 | 340.0 | 160.0 | 960.0000 | 5 | 2 | 0.0 | 4.0 | 0.125000 |
4 | 1 | 30 | 5707825200 | 0 | 19793.588912 | 6 | 0 | 0 | 0 | 0 | ... | 9552.0000 | 50.0 | 6830.0 | 540.0 | 2722.0000 | 6 | 2 | 0.0 | 2.0 | 0.457625 |
5 | 1 | 33 | 7214122800 | 0 | 24524.747419 | 5 | 0 | 1 | 1 | 1 | ... | 10661.0483 | 725.0 | 5746.0 | 360.0 | 4915.0483 | 4 | 0 | 0.0 | 4.0 | 1.049099 |
6 rows × 62 columns
selected = ['Glucose','paO2','PaO2_FiO2', 'iv_input', 'SOFA','reward']
n = 5000
mimic3_data_selected = mimic3_data[:n][selected]
mimic3_data_selected
Glucose | paO2 | PaO2_FiO2 | iv_input | SOFA | reward | |
---|---|---|---|---|---|---|
0 | 84.000000 | 84.000000 | 168.000000 | 2.0 | 5 | -0.884898 |
1 | 122.000000 | 59.444444 | 198.148148 | 0.0 | 12 | 0.383136 |
2 | 125.000000 | 192.000000 | 690.647482 | 0.0 | 4 | 0.976040 |
3 | 110.727273 | 179.000000 | 447.499993 | 4.0 | 5 | 0.125000 |
4 | 187.000000 | 125.000000 | 347.222222 | 2.0 | 6 | 0.457625 |
... | ... | ... | ... | ... | ... | ... |
4995 | 121.375000 | 136.787683 | 206.005547 | 3.0 | 4 | -1.965110 |
4996 | 108.000000 | 62.333333 | 143.846153 | 0.0 | 11 | -0.025000 |
4997 | 106.000000 | 258.500000 | 923.214286 | 0.0 | 7 | 0.402531 |
4998 | 144.000000 | 376.000000 | 752.000000 | 1.0 | 4 | -0.172130 |
4999 | 113.000000 | 108.000000 | 269.999996 | 4.0 | 5 | -0.025000 |
5000 rows × 6 columns
userinfo_index = np.array([0,1,2,4]) # record all the indices of patients' information
SandA = mimic3_data_selected.iloc[:, np.array([0,1,2,3,4])]
data_CEL_selected = mimic3_data_selected
data_CEL_selected.iloc[np.where(mimic3_data_selected['iv_input']!=0)[0],:] = 1
# change the discrete action to binary
data_CEL_selected.head(6)
Glucose | paO2 | PaO2_FiO2 | iv_input | SOFA | reward | |
---|---|---|---|---|---|---|
0 | 1.0 | 1.000000 | 1.000000 | 1.0 | 1 | 1.000000 |
1 | 122.0 | 59.444444 | 198.148148 | 0.0 | 12 | 0.383136 |
2 | 125.0 | 192.000000 | 690.647482 | 0.0 | 4 | 0.976040 |
3 | 1.0 | 1.000000 | 1.000000 | 1.0 | 1 | 1.000000 |
4 | 1.0 | 1.000000 | 1.000000 | 1.0 | 1 | 1.000000 |
5 | 1.0 | 1.000000 | 1.000000 | 1.0 | 1 | 1.000000 |
Final Mimic3 Data Selected for Causal Effect Learning (CEL)#
After pre-processing, we selected 4 features as the state variable in CEL, which represents the baseline information of the patients:
Glucose: glucose values of patients
paO2: The partial pressure of oxygen
PaO2_FiO2: The partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio.
SOFA: Sepsis-related Organ Failure Assessment score to describe organ dysfunction/failure.
The action variable is iv-input, which denotes the volumn of fluids that have been administered to the patient. Additionally, we set all non-zero iv-input values as \(1\) to create a binary action space.
The last column denotes the reward we evaluated according to the status of patients from several aspects.