Fitted-Q Evaluation#
The most straightforward approach for OPE is the direct method (DM). As suggested by the name, methods belonging to this category will first directly impose a model for either the environment or the Q-function, and then learn the model by regarding the task as a regression (or classification) problem, and finally calculate the value of the target policy via a plug-in estimator according to the definition of \(\eta^\pi\) The Q-function based approach and the environment-based approach are also called as model-free and model-based, respectively.
Among the many model-free DM estimators, we will focus on the most classic one, the fitted-Q evaluation (FQE) [LVY19]. It is observed to perform consistently well in a large-scale empirical study [VLJY19].
Advantages:
Conceptually simple and easy to implement
Good numerical results when the the model class is chosen appropriately
Appropriate application situations:
Due to the potential bias, FQE generally performs well in problems where
The model class can be chosen appropriately
Main Idea#
Q-function. The Q-function-based approach aims to direct learn the state-action value function (referred to as the Q-function)
of the policy \(\pi\) that we aim to evaluate.
The final estimator can then be constructed by plugging \(\hat{Q}^{\pi}\) in the definition \(\eta^{\pi} = \mathbb{E}_{s \sim \mathbb{G}, a \sim \pi(\cdot|s)} Q^{\pi}(a, s)\).
Bellman equations. The Q-learning-type evaluation is commonly based on the Bellman equation for the Q-function of a given policy \(\pi\)
FQE. FQE is mainly motivated by the fact that, the true value function \(Q^\pi\) is the unique solution to the Bellman equation (1). Besides, the right-hand side of (1) is a contraction mapping. Therefore, we can consider a fixed-point method: with an initial estimate \(\widehat{Q}^{0}\), FQE iteratively solves the following optimization problem,
for \(\ell=1,2,\cdots\), until convergence. The final estimator is denoted as \(\widehat{Q}_{FQE}\).
Demo [TODO]#
# After we publish the pack age, we can directly import it
# TODO: explore more efficient way
# we can hide this cell later
import os
os.getcwd()
os.chdir('..')
os.chdir('../CausalDM')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [1], in <cell line: 7>()
5 os.getcwd()
6 os.chdir('..')
----> 7 os.chdir('../CausalDM')
FileNotFoundError: [WinError 2] 系统找不到指定的文件。: '../CausalDM'
References#
- LVY19
Hoang M Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. arXiv preprint arXiv:1903.08738, 2019.
- VLJY19
Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854, 2019.