At the Deep Reinforcement Learning Symposium at NIPS this year I had the pleasure of shaking hands with Dimitri Bertsekas, whose work has been foundational to the mathematical theory of reinforcement learning. I still turn to Neuro-Dynamic Programming (Bertsekas and Tsitsiklis, 1996) when searching for tools to explain sample-based algorithms such as TD. Earlier this summer I read parts of Dynamic Programming and Optimal Control: Volume 2, which is full of gems that could get grown into a full-blown NIPS or ICML paper (try it yourself). You can imagine my excitement at finally meeting the legend. So when he candidly asked me, "What's deep reinforcement learning, anyway?" it was clear that "Q-learning with neural networks" wasn't going to cut it.
Here's my take on the matter. The novelty in deep reinforcement learning isn't the combination of deep networks with RL methods; one of the oft-cited early successes of RL, TD-Gammon, was all about the MLP. Nor is it really in its chief algorithms: experience replay is 25 years old. Instead, it seems to me that the distinctive characteristic of deep reinforcement learning is its emphasis on perceptually complex, messy environments such as the Arcade Learning Environment, richly textured 3D mazes, and vision-based robotics. To paraphrase Rich Sutton: deep reinforcement learning is not a solution, it's a collection of problems.
Many of the challenges posed by the characteristic environments of deep RL have roots in early reinforcement learning research. The omnipresence of partial observability in maze-like environments is driving us to study agent architectures with memory (recall Andrew McCallum's U-Tree experiments). The need for temporal abstraction, well evidenced by the enthusiasm for hierarchical reinforcement learning at NIPS this year, led to a flurry of approaches being developed in the late 90s. It's surprising then that the field has given relatively little attention to the classic literature – ranging from the design of the method of temporal differences as a purely predictive algorithm, to the wealth of results on exploration in Markov Decision Processes produced in the mid-2000's. It's unlikely that these prior methods, taken as-is, will scale to our new domains. Yet, as the saying goes, those who cannot remember the past must eventually repeat it, and I think the community will benefit from a modern exposure to that literature.
The aim of this blog is really twofold: first, to take a journey through some of the less appreciated results in reinforcement learning and decision making, and present them in a more accessible way. Second, to provide, when the occasion arises, simple proofs or illustrations of notions from RL folklore – filling in the gaps, if you will, left by past authors. As a prerequisite, reading the classics is highly recommended. In particular, a new edition of Sutton and Barto's seminal book is due to be released soon, replete with intuition and inspiration. And, of course, you'll find theoretical echoes of this blog in Dimitri Bertsekas's new book.
[update, 03 Feb 2019: stay tuned for a follow-up post]
Reinforcement learning: an introduction. Sutton and Barto, 2nd edition (draft available online).
Dynamic Programming and Optimal Control: Volume 2. Bertsekas (2012).
Neuro-Dynamic Programming. Bertsekas and Tsitsiklis (1996).
Algorithms for Reinforcement Learning. Szepesvari (2010).