PAC Model-free Reinforcement Learning
Abstract
For a Markov Decision Process with finite state (size <i>S</i>) and action spaces (size <i>A</i> per state), we propose a new algorithm---Delayed Q-Learning. We prove it is PAC, achieving near optimal performance except for Õ(<i>SA</i>) timesteps using <i>O(SA)</i> space, improving on the Õ(<i>S</i><sup>2</sup><i>A</i>) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience---no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.