CascadeLinTS

CascadeLinTS#

Overview#

Advantage: It is scalable when the features are used. It outperforms algorithms based on other frameworks, such as UCB, in practice.
Disadvantage: It is susceptible to model misspecification.
Application Situation: Useful when presenting a ranked list of items, with only one selected at each interaction. The outcome is binary.

Main Idea#

Motivated by observations in most real-world applications, which have a large number of candidate items, Zong et al. (2016) proposed using feature information that is widely available to improve learning efficiency. Utilizing the feature information of each item \(i\), CascadeLinTS [1] characterize \(\theta_{i}=E[W_t(i)]\) by assuming that

(139)#\[\begin{equation} \theta_{i} = logistic(\boldsymbol{s}_{i,t}^T \boldsymbol{\gamma}), \end{equation}\]

Similar to the Thompson Sampling algorithm with generalized linear bandits [2], we approximate the posterior distribution of \(\boldsymbol{\gamma}\) by its Laplace approximation. Specifically, we approximate the posterior of \(\boldsymbol{\gamma}\) as:

(140)#\[\begin{equation} \begin{split} \tilde{\boldsymbol{\gamma}}^{t} &\sim \mathcal{N}\Big(\hat{\boldsymbol{\gamma}}_{t}, \alpha^2 \boldsymbol{H}_{t}^{-1}\Big),\\ \boldsymbol{H}_{t} &= \sum_{t}\mu'(\boldsymbol{S}_{t}^{T}\hat{\boldsymbol{\gamma}}^{t})\boldsymbol{S}_{t}\boldsymbol{S}_{t}^{T}, \end{split} \end{equation}\]

Key Steps#

For round \(t = 1,2,\cdots\):

Approximate \(P(\boldsymbol{\gamma}|\mathcal{H}_{t})\) by the Laplace approximation;
Sample \(\tilde{\boldsymbol{\gamma}} \sim P(\boldsymbol{\gamma}|\mathcal{H}_{t})\);
Update \(\tilde{\boldsymbol{\theta}}\) as \(logistic(\boldsymbol{s}_{i,t}^T \tilde{\boldsymbol{\gamma}})\);
Take the action \(A_{t}\) w.r.t \(\tilde{\boldsymbol{\theta}}\) such that \(A_t = arg max_{a \in \mathcal{A}} E(R_t(a) \mid \tilde{\boldsymbol{\theta}})\);
Receive reward \(R_{t}\).

*Notations can be found in either the inroduction of the chapter “Structured Bandits” or the introduction of the cascading Bandit problems.

Demo Code#

Import the learner.#

import numpy as np
from causaldm.learners.CPL4.Structured_Bandits.Cascade import CascadeLinTS

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

Generate the Environment#

Here, we imitate an environment based on the Yelp dataset. The number of items recommended at each round, \(K\), is specified as \(3\).

from causaldm.learners.CPL4.Structured_Bandits.Cascade import _env_realCascade as _env
env = _env.Cascading_env(K = 3, seed = 0)

Specify Hyperparameters#

K: number of itmes to be recommended at each round
L: total number of candidate items
p: number of features (If the intercept is considerd, p includes the intercept as well.)
alpha: degree of exploration (default = 1)
retrain_freq: frequency to train the generalized linear model (i.e., update every retrain_freq steps)
seed: random seed

K = env.K
L = env.L
p = env.p
alpha = 1
retrain_freq = 1
seed = 0
LinTS_agent = CascadeLinTS.CascadeLinTS(K = K, L = L, p = p, alpha = alpha, 
                                        retrain_freq = retrain_freq, seed = seed)

Recommendation and Interaction#

We fisrt observe the feature information \(\boldsymbol{S}\) by S = env.Phi . (Note: if an intercept is considered, the S should include a column of ones). Starting from t = 0, for each step t, there are three steps:

Recommend an action (a set of ordered restaturants) A = LinTS_agent.take_action(S)
Get the reward from the environment (i.e., \(W\), \(E\), and \(R\)) W,E,R = env.get_reward(A)
Update the posterior distribution LinTS_agent.receive_reward(A,W,E,t,S)

t = 0
S = env.Phi
A = LinTS_agent.take_action(S)
W,E,R = env.get_reward(A)
LinTS_agent.receive_reward(A,W,E,t,S)
A, W, E, R

(array([1301, 2087, 1123], dtype=int64),
 array([0., 0., 0.]),
 array([1., 1., 1.]),
 0.0)

Interpretation: For step 0, the agent decides to display three top restaurants, the first of which is restaurant 1301, the second is restaurant 2087, and the third is restaurant 1123. Unfortunately, the customer does not show any interest in any of the recommended restaurants. As a result, the agent receives a zero reward at round \(0\).

References#

[1] Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., & Kveton, B. (2016). Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359.

[2] Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., & Boutilier, C. (2020, June). Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics (pp. 2066-2076). PMLR.