UCB#
Overview#
Advantage: There is no need to specify any hyper-parameters. It carefully quantifies the uncertainty of the estimation to better combine exploration and exploitation.
Disadvantage: Inefficient if there is a large number of action items.
Application Situation: discrete action space, binary/Gaussian reward space
Main Idea#
As the name suggested, the UCB algorithm estimates the upper confidence bound \(U_{a}^{t}\) of the mean of the potential reward of arm \(a\), \(R_t(a)\), based on the observations and then choose the action has the highest estimates. The class of UCB-based algorithms is firstly introduced by Auer et al. [1]. Generally, at each round \(t\), \(U_{a}^{t}\) is calculated as the sum of the estimated reward (exploitation) and the estimated confidence radius (exploration) of item \(i\) based on previous observations. Then, \(A_{t}\) is selected as
As an example, UCB [1] estimates the confidence radius as \(\sqrt{\frac{2log(t)}{\text{# item i played so far}}}\). Doing so, either the item with a large average reward or the item with limited exploration will be selected. Note that this algorithm support cases with either binary reward or continuous reward.
Algorithms Details#
Supposed there are \(K\) options, and the action space is \(\mathcal{A} = \{0,1,\cdots, K-1\}\). The UCB1 algorithm start with initializing the estimated upper confidence bound \(U_a^{0}\) and the count of being pulled \(C_a^{0}\) for each action \(a\) as 0. At each round \(t\), we greedily select an action \(A_t\) as
After observing the rewards corresponding to the selected action \(A_t\), we first update the total number of being pulled for \(A_t\) accordingly. Then, we estimate the upper confidence bound for each action \(a\) as
Key Steps#
Initializing the \(\boldsymbol{U}^0\) and \(\boldsymbol{C}^0\) for \(K\) items as 0
For t = \(0, 1,\cdots, T\):
select action \(A_t\) as the arm with the maximum \(U_a^t\)
Received the reward R, and update \(C\) and \(U\) with
(121)#\[\begin{align} C_{A_{t}}^{t+1} &= C_{A_{t}}^{t} + 1 \\ U_{A_{t}}^{t+1} &= \frac{1}{C_{A_{t}}^{t+1}}\sum_{t'=0}^{t}R_{t'}I(A_{t'}=A_{t}) + \sqrt{\frac{2*log(t+1)}{C_{A_{t}}^{t+1}}} \end{align}\]
Demo Code#
Import the learner.#
import numpy as np
from causaldm.learners.CPL4.MAB import UCB
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Generate the Environment#
Here, we imitate an environment based on the MovieLens data.
from causaldm.learners.CPL4.MAB import _env_realMAB as _env
env = _env.Single_Gaussian_Env(seed = 42)
Specify Hyperparameters#
K: # of arms
UCB_agent = UCB.UCB1(env.K)
Recommendation and Interaction#
Starting from t = 0, for each step t, there are three steps:
Recommend an action
A = UCB_agent.take_action()
Get the reward from the environment
R = env.get_reward(t,A)
Update the posterior distribution
UCB_agent.receive_reward(t,A,R)
t = 0
A = UCB_agent.take_action()
R = env.get_reward(A)
UCB_agent.receive_reward(t,A,R)
t, A, R
(0, 0, 2)
Interpretation: For step 0, the UCB agent recommend a Comedy (arm 0), and received a rate of 2 from the environment.
Demo Code for Bernoulli Bandit#
The steps are similar to those previously performed with a Gaussian Bandit.
env = _env.Single_Bernoulli_Env(seed=42)
UCB_agent = UCB.UCB1(env.K)
t = 0
A = UCB_agent.take_action()
R = env.get_reward(A)
UCB_agent.receive_reward(t,A,R)
t, A, R
(0, 0, 0)
Interpretation: For step 0, the UCB agent recommend a Comedy (arm 0), and received a reward of 0 from the environment.
References#
[1] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256.