Structured Bandit (Slate Recommendation)#
This chapter focuses on slate recommendation problems, where the agent needs to recommend a slate of items at each round. Here, an item can be a product, a web page, a movie, etc., depending on the application.
Problem Setting#
Let \(T\) be the total number of rounds, and \(N\) be the total number of items that are available to recommend. The agent would choose a slate of items at each round \(t = 1, \dots, T\). In other words, an action \(a\) is defined as a subset of items, and then the action space \(\mathcal{A}\) consists of all subsets satisfied with some problem-specific constraints. Then we will receive the corresponding stochastic observations \(\boldsymbol{Y}_t\) from the environment, which further determines the reward \(R_t\). We assume that the potential observation \(\boldsymbol{Y}_t(a)\) is generated by a domain model \(f\), and the potential reward \(R_t(a)\) is determined through a deterministic function \(f_r\). Mathematically, we consider the following popular and general class of bandit problems[1]:
This problem setting subsumes many popular bandit problems, such as
Dynamic Assortment Optimization: Multinomial Logit Bandits [2] aims to solve the most profitable subset of items to offer, especially when there exist substitution effects;
Online Learning to Rank: Cascading Bandits [3] is used to characterize how a user interacts with an ordered list of \(K\) items;
Online Combinatorial Optimization: Combinatorial Semi-Bandits [4] which have numerous applications, including maximum weighted matching, ad allocation, and news page optimization;
……
Note that, for any positive integer \(N\), we denote the set \(\{1, \dots, N\}\) by \([N]\). Further, we denote the cardinality of set \(A\) as \(|A|\). \(\mathcal{H}_{t}\) denotes a sequence of observations containing all the tuples of action and corresponding feedback received in previous rounds, excluding round \(t\). Specifically, \(\mathcal{H}_{t}=\{(A_{i},\boldsymbol{Y}_{i},R_{i})\}_{i=1}^{t-1}\). If feature information is avialable and used, we denote \(\boldsymbol{s}_i\) as a \(d\)-dimensional vector of item-specific features, and then \(\mathcal{H}_{t}=\{(A_{i},\boldsymbol{Y}_{i},R_{i},\boldsymbol{S}_i)\}_{i=1}^{t-1}\).
Real Data#
1. MovieLens
Movie Lens is a website that helps users find the movies they like and where they will rate the recommended movies. MovieLens 1M dataset is a dataset including the observations collected in an online movie recommendation experiment and is widely used to generate data for online bandit simulation studies. Here, in contrast to the conditions we employed for the simulation studies in the Multi-Armed Bandit Chapter, we treat each movie as an item and would suggest a slate of several movies for each round of interaction. By using the low-rank matrix factorization, features and the true utility \(\in [0,1]\) are retrieved from the dataset for each movie. For further information on the data preparation, see [6].
With the same aim of learning the reward distribution of various movies and subsequently recommending the best slate of movies to maximize the cumulative user satisfaction, we used the same pre-processed dataset for the following three simulation studies, with different settings and numerical objectives described as follows. This chapter focuses on the top 1000 movies with the most reviews, each of which has five-dimensional features.
Dynamic Assortment Optimization: The ultimate goal is to discover the optimal slate of movies with the highest probability that at least one of the recommended movies will appeal to the user. When a user visits the recommender system, the agent will recommend a slate of movies (\(A_t\)), and the user will either click on one of the movies (\(R_t=1\)) or leave with no click (\(R_t = 0\)).
Online Learning to Rank: Similarly, the ultimate goal is to discover the optimal ranked slate of movies with the highest probability that at least one of the recommended movies will appeal to the user. When a user visits the recommender system, the agent will recommend a ranked slate of movies (\(A_t\)) to the user, and the user will browse the recommended list of movies from top to bottom, stopping to click on the movie (\(R_t=1\)) when being attracted by the movie, or leaving with no click (\(R_t = 0\)) if not attracted by any movies.
Online Combinatorial Optimization: We now consider the ultimate goal of finding the optimal set of movies consisting of one comedy movie, one drama movie, one action movie, one thriller movie, and one other genre movie with the highest total utility. When a user visits the recommender system, the agent will recommend a slate of movies (\(A_t\)) to the user, and the user will rate each of the recommended movies, indicating the real-time utility of each recommended movie. The reward received, \(R_t\), is then defined as the sum of each movie’s rating.
Reference#
[1] Russo, D., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. (2017). A tutorial on thompson sampling. arXiv preprint arXiv:1707.02038.
[2] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2017, June). Thompson sampling for the mnl-bandit. In Conference on learning theory (pp. 76-78). PMLR.
[3] Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. (2015a). Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767–776. PMLR.
[4] Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In International conference on machine learning, pages 151–159. PMLR.
[6] Wan, R., Ge, L., & Song, R. (2022). Towards Scalable and Robust Structured Bandits: A Meta-Learning Framework. arXiv preprint arXiv:2202.13227.