Doubly Robust Online Policy Evaluator#

Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. We introduce the doubly robust online policy evaluator to infer the value under the estimated optimal policy in online learning.

Application Situations:

  1. Dependent data generated in the online environment;

  2. Unknown optimal policy or known fixed policy;

  3. Contain exploration and exploitation trade-off.

Advantage of the Evaluator:

  1. Double protection on the consistency of the value estimator of the unknown optimal/target policy;

  2. Provide a Wald-type confidence interval on the value estimator;

  3. Flexible to implement in practice.

Main Idea#

Overview#

We focus on online policy evaluation for contextual linear bandits with deterministic policy for illustration. We first explicitly characterize the probability of exploration in the online policy optimization under three commonly used bandit algorithms for exposition, including Upper Confidence Bound (UCB), Thompson Sampling (TS), and \(\epsilon\)-Greedy (EG) methods [[[link to be added]]]. We then propose doubly robust interval estimation (DREAM) method to infer the mean outcome of the optimal online policy. The DREAM provides double protection on the consistency of the proposed value estimator to the true value, given either the model of the probability of exploitation or the conditional mean outcome is correctly specified. Under standard assumptions for inferring the online sample mean, we show the value estimator under DREAM is asymptotically normal with a Wald-type confidence interval provided.

drawing drawing

Upper: the architecture of offline policy evaluation, with offline context-action-outcome triples \(\{(x_t,a_t,r_t)\}\) stored in the buffer to learn the value under a target policy \(\pi^*\) with data generated by a behavior policy \(\pi_b\). Lower: the architecture of DREAM for policy evaluation in online learning, where the context-action-outcome triple at time \(t\), \((x_t,a_t,r_t)\), is stored in the buffer to update the bandit policy \(\pi_t\) and at the meantime to evaluate its performance.

Framework#

Given a context \(\boldsymbol{x}_t\), we choose an action \(a_t\) based on a policy \(\pi(\cdot)\) which is defined as a deterministic function that maps the context space to the action space as \(\pi: \mathcal{X} \to \mathcal{A}\). We then receive a reward \(r_t\) for all \(t \in \mathcal{T} \). Denote the history observations previous to time step \(t\) as \(\mathcal{H}_{t-1}=\{\boldsymbol{x}_i,a_i,r_i\}_{1\leq i\leq t-1}\). Define the potential reward \(R^*(A=a)\) as the reward we would observe given the action as \(A=a\). Then the value (Dudík et al., 2011) of a given policy \(\pi(\cdot)\) is

\[\begin{equation*} V(\pi)\equiv \mathbb{E}_{\boldsymbol{X} \sim P_{\mathcal{X}}}\left[\mathbb{E}\{R^*(A=\pi(\boldsymbol{X} ))\}\right] =(SUVTA/Ignoribility)=\mathbb{E}_{\boldsymbol{X} \sim P_{\mathcal{X}}}\left[\mathbb{E}\{R|\boldsymbol{X}, A =\pi(\boldsymbol{X} )\}\right] =\mathbb{E}_{\boldsymbol{X} \sim P_{\mathcal{X}}}\left[\mu\{ \boldsymbol{X} , \pi(\boldsymbol{X} )\}\right] . \end{equation*}\]

We define the optimal policy as \(\pi^*(\boldsymbol{x}) \equiv argmax_{a\in \mathcal{A}} \mu(\boldsymbol{x },a), \forall \boldsymbol{x} \in \mathcal{X}\), that finds the optimal action based on the conditional mean outcome function given a context \(\boldsymbol{x}\). Thus, the optimal value can be defined as \(V^*\equiv V(\pi^*)=\mathbb{E}_{\boldsymbol{X} \sim P_{\mathcal{X}}}\left[\mu\{ \boldsymbol{X} , \pi^*(\boldsymbol{X} )\}\right]\). In the rest, to simplify the exposition, we focus on two actions, i.e., \(\mathcal{A}=\{0,1\}\). Then the optimal policy is given by

\[\begin{equation*} \pi^*(\boldsymbol{x}) \equiv argmax_{a\in \mathcal{A}} \mu(\boldsymbol{x},a) = \mathbb{I}\{\mu(\boldsymbol{x},1)>\mu(\boldsymbol{x},0)\}, \quad \forall \boldsymbol{x} \in \mathcal{X}. \end{equation*}\]

Our goal is to infer the value under the optimal policy \(\pi^* \) from the online data generated sequentially by some bandit algorithms. Since the optimal policy is unknown, we estimate the optimal policy from the online data as \(\widehat{\pi}_t\) by \(\mathcal{H}_{t}\).

Probability of Exploration#

We next quantify the probability of exploring non-optimal actions at each time step. Define the status of exploration as \(\mathbb{I}\{a_t\not = \widehat{\pi}_t(\boldsymbol{x}_t )\}\), indicating whether the action taken by the bandit algorithm is different from the estimated optimal action that exploits the historical information. Here, \(\widehat{\pi}_t\) can be viewed as the greedy policy at time step \(t\). Thus the probability of exploration is defined by

\[\begin{equation*} \kappa_t(\boldsymbol{x}_t)\equiv Pr\{a_t\not = \widehat{\pi}_t(\boldsymbol{x}_t )\} = \mathbb{E} [ \mathbb{I}\{a_t\not = \widehat{\pi}_t(\boldsymbol{x}_t )\}], \end{equation*}\]

where the expectation in the last term is taken respect to \(a_t \in \mathcal{A}\) and history \(\mathcal{H}_{t-1}\).

Doubly Robust Online Policy Evaluator#

In contrast to the probability of exploration, we define the probability of exploitation as

\[\begin{equation*} Pr\{a_t = \widehat{\pi}_t(\boldsymbol{x}_t )\} = 1-{\kappa}_t (\boldsymbol{x}_t ) = \mathbb{E}[ \mathbb{I}\{a_t = \widehat{\pi}_t(\boldsymbol{x}_t )\}]. \end{equation*}\]

Following the doubly robust value estimator in Dudík et al., (2011), we propose the doubly robust mean outcome estimator as the value estimator under the optimal policy as

(159)#\[\begin{equation}\label{dr_est} \widehat{V}_T={1\over T }\sum_{t=1}^T {\mathbb{I}\{a_t=\widehat{\pi}_t(\boldsymbol{x}_t )\} \over 1-\widehat{\kappa}_{t}(\boldsymbol{x}_t )} \Big[ r_t -\widehat{\mu}_{t-1}\{\boldsymbol{x}_t ,\widehat{\pi}_t(\boldsymbol{x}_t )\}\Big] + \widehat{\mu}_{t-1}\{\boldsymbol{x}_t ,\widehat{\pi}_t(\boldsymbol{x}_t ) \}, \end{equation}\]

where \(T\) is the current/termination time, and \(1-\widehat{\kappa}_{t}(\boldsymbol{x}_t )\) is the estimated matching probability between the chosen action \(a_t\) and estimated optimal action given \(\boldsymbol{x}_t \), which captures the probability of exploitation. Our value estimator provides double protection on the consistency to the true value, given either the model of the probability of exploitation or the model of the conditional mean outcome is correctly specified.

Demo Code#

In the following, we exhibit how to apply the doubly robust online policy evaluator on real data under UCB, TS, and EG, respectively.

1. Policy Evaluation under UCB#

2. Policy Evaluation under TS#

3. Policy Evaluation under EG#

References#

[1] Cai, H., Shen, Y., & Song, R. (2021). Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. arXiv preprint arXiv:2110.15501.

[2] Chen, H., Lu, W., & Song, R. (2021). Statistical inference for online decision making: In a contextual bandit setting. Journal of the American Statistical Association, 116(533), 240-255.

[3] Chen, H., Lu, W., & Song, R. (2021). Statistical inference for online decision making via stochastic gradient descent. Journal of the American Statistical Association, 116(534), 708-719.

[4] Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601.