Multi-Armed Bandits (MAB)#
The multi-armed bandit (MAB) [1] is the most classical version of the bandit problems, in which an agent will sequentially select an item (arm) from a few and then receive a random reward for the item selected. MAB has been extensively studied and widely applied to different areas, including healthcare, recommender system, and finance, to name a few. See [1] for a detailed review of MAB and [2] for a survey of practical applications. Among them, since the reward distributions are typically unknown in real-world applications, the central task of a MAB algorithm is to learn the distributions from feedback received and find the optimal item to maximize the cumulative rewards or, equivalently, to minimize the cumulative regret. For example, considering the motivating example Recommender Systems, suppose that there are a few of movie geners available to be recommended, MAB algorithms aims to find and recommend the optimal movie genres whenever a user visits, with the ultimate objective of optimizing the overall user satisfication. This chapter focuses on the MAB problems by illustrating a group of classical algorithms to tackle the well-known exploration-exploitation trade-off, including
-greedy,Thompson Sampling (TS),
Upper Confidence Bounds (UCB).
Problem Setting#
Let
Graphical Data Structure#
Real Data#
1. MovieLens
Movie Lens is a website that helps users find the movies they like and where they will rate the recommended movies. MovieLens 1M dataset is a dataset including the observations collected in an online movie recommendation experiment and is widely used to generate data for online bandit simulation studies. The goal of the simulation studies below is to learn the reward distribution of different movie genres and hence to recommend the optimal movie genres to the users to optimize the cumulative user satisfaction. In other words, every time a user visits the website, the agent will recommend a movie genre (
Comedy:
,Drama:
,Action:
,Thriller:
,Sci-Fi:
.
Therefore,
Gaussian Bandit:
is a numerical variable, taking the value of , where 1 is the least satisfied and 5 is the most satisfied. In the following table, we summarize the empirical distribution of rates for each genre.
1 |
2 |
3 |
4 |
5 |
|
---|---|---|---|---|---|
Comedy(0) |
.081 |
.153 |
.311 |
.309 |
.146 |
Drama(1) |
.045 |
.114 |
.288 |
.350 |
.204 |
Action(2) |
.096 |
.170 |
.312 |
.287 |
.135 |
Thriller(3) |
.076 |
.160 |
.313 |
.305 |
.146 |
Sci-Fi(4) |
.104 |
.176 |
.300 |
.278 |
.143 |
Bernoulli Bandit:
is a binary variable, =1 if the rating is higher than 3. In the following table, we summarize the empirical probability of getting a reward of 1 for each genre.
Comedy(0) |
Drama(1) |
Action(2) |
Thriller(3) |
Sci-Fi(4) |
---|---|---|---|---|
.455 |
.553 |
.421 |
.451 |
.420 |
Reference#
[1] Slivkins, A. (2019). Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272.
[2] Bouneffouf, D. and Rish, I. (2019). A survey on practical applications of multi-armed and contextual bandits. arXiv preprint arXiv:1904.10040.