Fitted-Q Iteration#
Main Idea#
Q-function. The Q-function-based approach aims to direct learn the state-action value function (referred to as the Q-function)
of either the policy \(\pi\) that we aim to evaluate or the optimal policy \(\pi = \pi^*\).
Bellman optimality equations. The Q-learning-type policy learning is commonly based on the Bellman optimality equation, which characterizes the optimal policy \(\pi^*\) and is commonly used in policy optimization. Specifically, \(Q^*\) is the unique solution of
FQI. Similar to FQE, the fitted-Q iteration (FQI) [EGW05] algorithm is also popular due to its simple form and good numerical performance. It is mainly motivated by the fact that, the optimal value function \(Q^*\) is the unique solution to the Bellman optimality equation (2). Besides, the right-hand side of (2) is a contraction mapping. Therefore, we can consider a fixed-point method: with an initial estimate \(\widehat{Q}^{0}\), FQI iteratively solves the following optimization problem,
for \(\ell=1,2,\cdots\), until convergence. The final estimate is denoted as \(\widehat{Q}_{FQI}\).
Demo [TODO]#
# After we publish the pack age, we can directly import it
# TODO: explore more efficient way
# we can hide this cell later
import os
os.getcwd()
os.chdir('..')
os.chdir('../CausalDM')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [1], in <cell line: 7>()
5 os.getcwd()
6 os.chdir('..')
----> 7 os.chdir('../CausalDM')
FileNotFoundError: [WinError 2] 系统找不到指定的文件。: '../CausalDM'
References#
- EGW05
Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.