MTSS_Comb#
Overview#
Advantage: It is both scalable and robust. Furthermore, it also accounts for the iter-item heterogeneity.
Disadvantage:
Application Situation: Useful when presenting a list of items, each of which will generate a partial outcome (reward). The outcome is continuous. Static feature information.
Main Idea#
MTSS_Comb is an example of the general Thompson Sampling(TS)-based framework, MTSS [1], to deal with online combinatorial optimization problems.
Review of MTSS: MTSS[1] is a meta-learning framework designed for large-scale structured bandit problems [2]. Mainly, it is a TS-based algorithm that learns the information-sharing structure while minimizing the cumulative regrets. Adapting the TS framework to a problem-specific Bayesian hierarchical model, MTSS simultaneously enables information sharing among items via their features and models the inter-item heterogeneity. Specifically, it assumes that the item-specific parameter \(\theta_i = E[Y_{t}(i)]\) is sampled from a distribution \(g(\theta_i|\boldsymbol{s}_i, \boldsymbol{\gamma})\) instead of being entirely determined by \(\boldsymbol{s}_i\) via a deterministic function. Here, \(g\) is a model parameterized by an unknown vector \(\boldsymbol{\gamma}\). The following is the general feature-based hierarchical model MTSS considered.
where \(Q(\boldsymbol{\gamma})\) is the prior distribution for \(\boldsymbol{\gamma}\). Overall, MTTS is a general framework that subsumes a wide class of practical problems, scalable to large systems, and robust to the specification of the generalization model.
Review of MTSS_Comb: In this tutorial, as an example, we focus on the combinatorial semi-bandits with Gaussian outcome, \(Y_{i, t}\), and consider using a linear mixed model (LMM) as the generalization model to share information. Specifically, the full model is as follows:
where it is typically assumed that \(\sigma_1\) and \(\sigma_2\) are known. We choose the prior \(\boldsymbol{\gamma} \sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{\gamma}}, {\boldsymbol{\Sigma}}_{\boldsymbol{\gamma}})\) with parameters as known. It is worth noting that many other outcome distributions (e.g., Bernoulli) and model assumptions (e.g., Gaussian process) can be formulated similarly, depending on the applications. Users can directly modify the code for posterior updating to adapt different model assumptions. Further, for simplicity, we consider the most basic size constraint such that the action space includes all the possible subsets with size \(K\). Therefore, the optimization process to find the optimal subset \(A_{t}\) is equal to selecting a list of \(K\) items with the highest attractiveness factors. Of course, users are welcome to modify the optimization function to satisfy more complex constraints.
Algorithm Details#
Under the assumption of a LMM, the posteriors can be derived explicitly, following the Bayes’ theorem. At each round \(t\), given the feedback \(\mathcal{H}_{t}\) received from previous rounds, there are two major steps including posterior sampling and combinatorial optimization. Specifically, the posterior sampling step is decomposed into four steps: 1. updating the posterior distribution of \(\boldsymbol{\gamma}\), \(P(\boldsymbol{\gamma}|\mathcal{H}_{t})\); 2. sampling a \(\tilde{\boldsymbol{\gamma}}\) from \(P(\boldsymbol{\gamma}|\mathcal{H}_{t})\); 3. updating the posterior distribution of \(\boldsymbol{\theta}\) conditional on \(\tilde{\boldsymbol{\gamma}}\), \(P(\boldsymbol{\theta}|\tilde{\boldsymbol{\gamma}},\mathcal{H}_{t})\); 4. sampling \(\tilde{\boldsymbol{\theta}}\) from \(P(\boldsymbol{\theta}|\tilde{\boldsymbol{\gamma}},\mathcal{H}_{t})\). Then, the action \(A_{t}\) is selected greedily as \(A_t = arg max_{a \in \mathcal{A}} E(R_t(a) \mid \tilde{\boldsymbol{\theta}})\). Considering the simple size constraint, \(A_{t}\) is the list of \(K\) items with the highest \(\tilde{\theta}_{i}\). Note that \(\tilde{\boldsymbol{\gamma}}\) can be sampled in a batch mode to further facilitate computationally efficient online deployment.
Key Steps#
For round \(t = 1,2,\cdots\):
Update \(P(\boldsymbol{\gamma}|\mathcal{H}_{t})\);
Sample \(\tilde{\boldsymbol{\gamma}} \sim P(\boldsymbol{\gamma}|\mathcal{H}_{t})\);
Update \(P(\boldsymbol{\theta}|\tilde{\boldsymbol{\gamma}},\mathcal{H}_{t})\);
Sample \(\tilde{\boldsymbol{\theta}} \sim P(\boldsymbol{\theta}|\tilde{\boldsymbol{\gamma}},\mathcal{H}_{t})\);
Take the action \(A_{t}\) w.r.t \(\tilde{\boldsymbol{\theta}}\) such that \(A_t = arg max_{a \in \mathcal{A}} E(R_t(a) \mid \tilde{\boldsymbol{\theta}})\);
Receive reward \(R_{t}\).
*Notations can be found in either the inroduction of the chapter “Structured Bandits” or the introduction of the combinatorial Semi-Bandit problems.
Demo Code#
Import the learner.#
import numpy as np
from causaldm.learners.CPL4.Structured_Bandits.Combinatorial_Semi import MTSS_Comb
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Generate the Environment#
Here, we imitate an environment based on the Adult dataset. The length of horizon, \(T\), is specified as \(500\).
from causaldm.learners.CPL4.Structured_Bandits.Combinatorial_Semi import _env_realComb as _env
env = _env.CombSemi_env(T = 500, seed = 0)
Specify Hyperparameters#
K: number of itmes to be recommended at each round
L: total number of candidate items
p: number of features (If the intercept is considerd, p includes the intercept as well.)
sigma_1: \(\sigma_1\)
sigma_2: \(\sigma_2\)
prior_gamma_mean: mean of the Gaussian prior of the \(\boldsymbol{\gamma}\)
prior_gamma_cov: the covariance matrix of the Gaussian prior of \(\boldsymbol{\gamma}\)
Xs: feature informations \(\boldsymbol{S}\) (Note: if an intercept is considered, the \(\boldsymbol{S}\) should include a column of ones)
update_freq: frequency to update the posterior distribution of \(\boldsymbol{\gamma}\) (i.e., update every update_freq steps)
seed: random seed
sigma_1 = 1
sigma_2 = 1
L = env.L
T = 100
K = 10
p = env.p
gamma_prior_mean = np.zeros(env.p)
gamma_prior_cov = np.identity(env.p)
Xs = env.Phi
update_freq = 1
seed = 0
MTSS_agent = MTSS_Comb.MTSS_Semi(sigma_1 = sigma_1, sigma_2 = sigma_2, L=L, T = T, K = K,
gamma_prior_mean = gamma_prior_mean, gamma_prior_cov = gamma_prior_cov,
Xs = Xs, update_freq = update_freq, seed = seed)
Recommendation and Interaction#
We fisrt observe the feature information \(\boldsymbol{S}\) by
S = env.Phi
. (Note: if an intercept is considered, the S should include a column of ones).
Starting from t = 0, for each step t, there are three steps:
Recommend an action (a set of ordered restaturants)
A = MTSS_agent.take_action(S)
Get the reward of each item recommended from the environment
R, _, tot_R = env.get_reward(A, t)
Update the posterior distribution
MTSS_agent.receive_reward(t, A, R, S)
S = env.Phi
t = 0
A = MTSS_agent.take_action(S)
R, _, tot_R = env.get_reward(A, t)
MTSS_agent.receive_reward(t, A, R, S)
t, A, R, tot_R
(0,
array([ 686, 2132, 1114, 689, 1523, 1645, 1733, 2671, 1611, 2099],
dtype=int64),
array([2.1668, 1.9462, 2.5764, 2.4307, 1.9444, 1.9867, 1.846 , 1.504 ,
2.3613, 1.6928]),
20.45535406270607)
Interpretation: For step 0, the agent decides to send the advertisement to 10 potential customers (686, 2132, 1114, 689, 1523, 1645, 1733, 2671, 1611, 2099), and then receives a total reward of \(20.46\).
References#
[1] Wan, R., Ge, L., & Song, R. (2022). Towards Scalable and Robust Structured Bandits: A Meta-Learning Framework. arXiv preprint arXiv:2202.13227. [2] Van Parys, B., & Golrezaei, N. (2020). Optimal learning for structured bandits. Available at SSRN 3651397.