Single Stage – Paradigm 1

Single Stage – Paradigm 1#

Real Data 1. Movie Lens#

Movie Lens is a movie recommendation website that helps users to find movies and collect their ratings. The goal of the simulation studies in single stage causal effect learning is to infer on the causal effect of treating users ‘Drama’, versus the control movie genere ‘Sci-Fi’. This serves as an offline evaluation of how well people like/dislike a specific movie genere versus the other, and hence provides us a general scope of which movie genere to recommend so as to maximize users’ satisfaction.

Data Pre-processing#

# import related packages
import os
import pickle
import numpy as np

from causaldm.learners.CPL4.CMAB import _env_realCMAB as _env
data = _env.get_movielens()

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

data.keys()

dict_keys(['Individual', 'Xs', 'mean_ri', 'standardized_Xs'])

data_ML = data['Individual']

userinfo_index = np.array([3,9,11,12,13,14])

users_index = data_ML.keys()
n = len(users_index) # the number of users
movie_generes = ['Comedy', 'Drama', 'Action', 'Thriller', 'Sci-Fi']

data_CEL = {}
 
# initialize the final data we'll use in Causal Effect Learning
for i in movie_generes:
    data_CEL[i] = None   

import pandas as pd
for movie_genere in movie_generes:
      for user in users_index:
            data_CEL[movie_genere] = pd.concat([data_CEL[movie_genere] , data_ML[user][movie_genere]['complete']])

data_CEL['Comedy']

	user_id	movie_id	rating	age	Comedy	Drama	Action	Thriller	Sci-Fi	gender_M	occupation_academic/educator	occupation_college/grad student	occupation_executive/managerial	occupation_other	occupation_technician/engineer
4220	48	2355.0	4.0	25.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
14400	48	2918.0	4.0	25.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
16752	48	2791.0	4.0	25.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
20195	48	2797.0	4.0	25.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
21689	48	2321.0	3.0	25.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
393463	5878.0	3299.0	3.0	25.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
395410	5878.0	892.0	5.0	25.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
396058	5878.0	574.0	1.0	25.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
397794	5878.0	1812.0	5.0	25.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
400719	5878.0	3830.0	1.0	25.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0

49563 rows × 15 columns

data_CEL_all = pd.concat([data_CEL['Drama'], data_CEL['Sci-Fi']]) 
data_CEL_all = data_CEL_all.drop(columns=['Comedy', 'Action', 'Thriller', 'Sci-Fi'])
#data_CEL_all.to_csv("/Users/alinaxu/Documents/CDM/CausalDM/causaldm/data/MovieLens_CEL.csv")
data_CEL_all

	user_id	movie_id	rating	age	Drama	gender_M	occupation_academic/educator	occupation_college/grad student	occupation_executive/managerial	occupation_other	occupation_technician/engineer
14	48	1193.0	4.0	25.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0
11057	48	919.0	4.0	25.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0
25871	48	527.0	5.0	25.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0
31166	48	1721.0	4.0	25.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0
40383	48	150.0	4.0	25.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...
303406	5878.0	3300.0	2.0	25.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
320275	5878.0	1391.0	1.0	25.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
332011	5878.0	185.0	4.0	25.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
382221	5878.0	2232.0	1.0	25.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
397209	5878.0	426.0	3.0	25.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0

65642 rows × 11 columns

Final Movie Lens Data Selected for Causal Effect Learning (CEL)#

After pre-processing, the complete data contains 65,642 movie watching history of 175 individuals. We set treatment \(A=1\) when the user choose a ‘Drama’, and \(A=0\) if the movie belongs to ‘Sci-Fi’.

The processed data is saved in ‘causaldm/data/MovieLens_CEL.csv’ and will be directly used in later subsections.

Real Data 2. Mimic3#

https://www.kaggle.com/datasets/asjad99/mimiciii

Mimic3 is a large open-access anonymized single-center database which consists of comprehensive clinical data of 61,532 critical care admissions from 2001–2012 collected at a Boston teaching hospital. Dataset consists of 47 features (including demographics, vitals, and lab test results) on a cohort of sepsis patients who meet the sepsis-3 definition criteria.

In causal effect learning, we try to estimate the treatment effect of conducting a specific intervention (e.g use of ventilator) to the patient, either given a particular patient’s characteristics and physiological information, or evaluate all patients treatment effect as a whole.

The original Mimic3 data was loaded from mimic3_sepsis_data.csv. For illustration purpose, we selected several representative features for the following analysis.

Data Pre-processing#

# import related packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt;
from sklearn.linear_model import LinearRegression
#from causaldm.data import mimic3_sepsis_data

# Get data
mimic3_data = pd.read_csv("mimic3_sepsis_data.csv")
mimic3_data.head(6)

	bloc	icustayid	charttime	gender	age	elixhauser	re_admission	died_in_hosp	died_within_48h_of_out_time	mortality_90d	...	input_total	input_4hourly	output_total	output_4hourly	cumulated_balance	SOFA	SIRS	iv_input	reward
0	1	3	7245486000	0	17639.826435	0	0	0	0	1	...	6527.0000	50.0	13617.0	520.0	-7090.0000	5	1	2.0	-0.884898
1	1	11	6898241400	1	30766.069028	6	1	0	0	0	...	0.0000	0.0	0.0	0.0	0.0000	12	0	0.0	0.383136
2	1	12	5805732000	1	12049.217303	0	0	0	0	0	...	0.0000	0.0	0.0	0.0	0.0000	4	2	0.0	0.976040
3	1	14	4264269300	0	30946.970000	2	0	0	0	1	...	1300.0000	1300.0	340.0	160.0	960.0000	5	2	4.0	0.125000
4	1	30	5707825200	0	19793.588912	6	0	0	0	0	...	9552.0000	50.0	6830.0	540.0	2722.0000	6	2	2.0	0.457625
5	1	33	7214122800	0	24524.747419	5	0	1	1	1	...	10661.0483	725.0	5746.0	360.0	4915.0483	4	0	4.0	1.049099

6 rows × 62 columns

selected = ['Glucose','paO2','PaO2_FiO2',  'iv_input', 'SOFA','reward']
n = 5000
mimic3_data_selected = mimic3_data[:n][selected]
mimic3_data_selected

	Glucose	paO2	PaO2_FiO2	iv_input	SOFA	reward
0	84.000000	84.000000	168.000000	2.0	5	-0.884898
1	122.000000	59.444444	198.148148	0.0	12	0.383136
2	125.000000	192.000000	690.647482	0.0	4	0.976040
3	110.727273	179.000000	447.499993	4.0	5	0.125000
4	187.000000	125.000000	347.222222	2.0	6	0.457625
...	...	...	...	...	...	...
4995	121.375000	136.787683	206.005547	3.0	4	-1.965110
4996	108.000000	62.333333	143.846153	0.0	11	-0.025000
4997	106.000000	258.500000	923.214286	0.0	7	0.402531
4998	144.000000	376.000000	752.000000	1.0	4	-0.172130
4999	113.000000	108.000000	269.999996	4.0	5	-0.025000

5000 rows × 6 columns

userinfo_index = np.array([0,1,2,4]) # record all the indices of patients' information
SandA = mimic3_data_selected.iloc[:, np.array([0,1,2,3,4])]

data_CEL_selected = mimic3_data_selected
data_CEL_selected.iloc[np.where(mimic3_data_selected['iv_input']!=0)[0],:] = 1
# change the discrete action to binary
data_CEL_selected.head(6)

	Glucose	paO2	PaO2_FiO2	iv_input	SOFA	reward
0	1.0	1.000000	1.000000	1.0	1	1.000000
1	122.0	59.444444	198.148148	0.0	12	0.383136
2	125.0	192.000000	690.647482	0.0	4	0.976040
3	1.0	1.000000	1.000000	1.0	1	1.000000
4	1.0	1.000000	1.000000	1.0	1	1.000000
5	1.0	1.000000	1.000000	1.0	1	1.000000

Final Mimic3 Data Selected for Causal Effect Learning (CEL)#

After pre-processing, we selected 4 features as the state variable in CEL, which represents the baseline information of the patients:

Glucose: glucose values of patients
paO2: The partial pressure of oxygen
PaO2_FiO2: The partial pressure of oxygen (PaO2)/fraction of oxygen delivered (FIO2) ratio.
SOFA: Sepsis-related Organ Failure Assessment score to describe organ dysfunction/failure.

The action variable is iv-input, which denotes the volumn of fluids that have been administered to the patient. Additionally, we set all non-zero iv-input values as \(1\) to create a binary action space.

The last column denotes the reward we evaluated according to the status of patients from several aspects.

Single Stage – Paradigm 1

Contents

Single Stage – Paradigm 1#

Real Data 1. Movie Lens#

Data Pre-processing#

Final Movie Lens Data Selected for Causal Effect Learning (CEL)#

Real Data 2. Mimic3#

Data Pre-processing#

Final Mimic3 Data Selected for Causal Effect Learning (CEL)#