LinUCB#
Overview#
Advantage: It is more scalable and efficient than UCB by utilizing features.
Disadvantage:
Application Situation: discrete action space, Gaussian reward space
Main Idea#
Supposed there are \(K\) options, and the action space is \(\mathcal{A} = \{0,1,\cdots, K-1\}\). LinUCB[1] uses feature information to guide exploration by assuming a linear model between the expected potential reward and the features. Specifcially, for the Gaussain bandits, we assume that
As for the Bernoulli bandits, we assume that
Finally, using the estimated upper confidence bounds, \(A_t = \arg \max_{a \in \mathcal{A}} U_a^t\) would be selected.
Key Steps#
Initializing \(\hat{\boldsymbol{\gamma}}=\boldsymbol{0}\) and \(\boldsymbol{V} = I\), and specifying \(\alpha\);
For t = \(0, 1,\cdots, T\):
Calculate the upper confidence bound \(U_a^t\);
Select action \(A_t\) as the arm with the maximum \(U_a^t\);
Receive the reward R, and update \(\hat{\boldsymbol{\gamma}}\), \(V\).
Demo Code#
Import the learner.#
import numpy as np
from causaldm.learners.CPL4.CMAB import LinUCB
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Generate the Environment#
Here, we imitate an environment based on the MovieLens data.
from causaldm.learners.CPL4.CMAB import _env_realCMAB as _env
env = _env.Single_Contextual_Env(seed = 0, Binary = False)
Specify Hyperparameters#
K: number of arms
p: number of features per arm
alpha: rate of exploration
seed: random seed
exploration_T: number of rounds to do random exploration at the beginning
alpha = .1
K = env.K
p = env.p
seed = 42
exploration_T = 10
LinUCB_Gaussian_agent = LinUCB.LinUCB_Gaussian(alpha = .1, K = K, p = p, seed = seed, exploration_T = exploration_T)
Recommendation and Interaction#
Starting from t = 0, for each step t, there are three steps:
Observe the feature information
X = env.get_Phi(t)
Recommend an action
A = LinUCB_Gaussian_agent.take_action(X)
Get the reward from the environment
R = env.get_reward(t,A)
Update the posterior distribution
LinUCB_Gaussian_agent.receive_reward(t,A,R,X)
t = 0
X, feature_info = env.get_Phi(t)
A = LinUCB_Gaussian_agent.take_action(X)
R = env.get_reward(t,A)
LinUCB_Gaussian_agent.receive_reward(t,A,R,X)
t,A,R,feature_info
(0, 4, 4.0, (25.0, 1.0, 'college/grad student'))
Interpretation: For step 0, the TS agent encounter a male user who is a 25-year-old college/grad student. Given the information, the agent recommend a Sci-Fi (arm 4), and receive a rate of 4 from the user.
Demo Code for Bernoulli Bandit#
The steps are similar to those previously performed with a Gaussian Bandit.
env = _env.Single_Contextual_Env(seed = 0, Binary = True)
K = env.K
p = env.p
seed = 42
alpha = 1 # exploration rate
retrain_freq = 1 #frequency to train the GLM model
exploration_T = 10
LinUCB_GLM_agent = LinUCB.LinUCB_GLM(K = K, p = p , alpha = alpha, retrain_freq = retrain_freq,
seed = seed, exploration_T = exploration_T)
t = 0
X, feature_info = env.get_Phi(t)
A = LinUCB_GLM_agent.take_action(X)
R = env.get_reward(t,A)
LinUCB_GLM_agent.receive_reward(t,A,R,X)
t,A,R,feature_info
(0, 3, 0, (25.0, 1.0, 'college/grad student'))
Interpretation: For step 0, the TS agent encounter a male user who is a 25-year-old college/grad student. Given the information, the agent recommend a Thriller (arm 3), and receive a rate of 0 from the user.
References#
[1] Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011, June). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 208-214). JMLR Workshop and Conference Proceedings.