Fitted-Q Iteration#
Main Idea#
Q-function. The Q-function-based approach aims to direct learn the state-action value function (referred to as the Q-function)
of either the policy \(\pi\) that we aim to evaluate or the optimal policy \(\pi = \pi^*\).
Bellman optimality equations. The Q-learning-type policy learning is commonly based on the Bellman optimality equation, which characterizes the optimal policy \(\pi^*\) and is commonly used in policy optimization. Specifically, \(Q^*\) is the unique solution of
FQI. Similar to FQE, the fitted-Q iteration (FQI) [EGW05] algorithm is also popular due to its simple form and good numerical performance. It is mainly motivated by the fact that, the optimal value function \(Q^*\) is the unique solution to the Bellman optimality equation (2). Besides, the right-hand side of (2) is a contraction mapping. Therefore, we can consider a fixed-point method: with an initial estimate \(\widehat{Q}^{0}\), FQI iteratively solves the following optimization problem,
for \(\ell=1,2,\cdots\), until convergence. The final estimate is denoted as \(\widehat{Q}_{FQI}\).
References#
Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.