Reinforcement Learning (RL) is concerned with the problem of an agent trying to maximize a scalar reward signal through interaction within its environment [1]. During the process no supervision is being provided i.e., the agent is never directly informed about the best action at a given moment in time. Based on advances in Deep Learning, the field has celebrated success in complex tasks i.e., solving various video games [2] and board games like chess and go [3]. The Deep Q-Network used by [1] is considered a model-free RL algorithm i.e., no explicit function for modeling the dynamics of the system is being approximated. On the other hand, model-based methods learn a dynamics model of the environment from observed data. While these model-based methods are arguably considered to be more sample efficient, they suffer from model bias, i.e., they inherently assume that the learned dynamics model sufficiently accurately resembles the real environment [4]. Model bias is especially an issue when only a few samples and no informative prior knowledge about the task to be learned are available. Therefore, methods like [5] that use probabilistic function approximators like Gaussian Processes [6], which can incorporate uncertainty into model approximation, can cope with bias to allow for more robust long term planning. Alternatives to GP function approximators which incorporate uncertainty but scale better with the number of available data points are Bayesian Neural Networks [7]. Nonetheless, uncertainty estimation becomes difficult or even unfeasible under harder system dynamics like complex locomotion or grasping. For harder dynamics, model-free algorithms offer the alternative to directly and therefore only learning the quantity of interest i.e., (the parameters of) a policy for optimal behavior within the environment. A real-world sports analogy is baseball where the performance of a pitcher\footnote{In baseball, a pitcher is a person that tries to throw the ball to the catcher while retiring the batter who is placed in between.} is judged by the actual throw of the ball rather than his reasoning over the physical dynamics of the ball throw. The methods presented in the following are policy gradient methods [8,9], a subclass of model-free RL algorithms. Policy gradient methods rely upon gradient descent optimization of parametrized policies with respect to the expected return or long term cumulative reward. Mathematically formalized, the RL scenario is given as a Markov Decision Process ChainerRL
[19] and garage
[20] (which is the successor to the rllab
framework). As has been observed, the mentioned state of the art arguably differs only partially to any other competitive method belonging to the same algorithmic class i.e., the methodological variance within the family of policy gradient methods is arguably low. Therefore, a method can benefit from a combination of additional techniques that have been empirically shown to be supportive of learning (e.g. entropy regularization) or improve computational efficiency (e.g. clipping). The work in [21] performs an ablation study on various factors that improve upon baseline versions of the policy gradient class of model-free algorithms.
References
- R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
- D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
- J. G. Schneider, “Exploiting model uncertainty estimates for safe dynamic control learning,” in Advances in neural information processing systems, pp. 1047–1053, 1997.
- M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011.
- C. E. Rasmussen, “Gaussian processes in machine learning,” in Summer School on Machine Learning, pp. 63–71, Springer, 2003.
- C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424, 2015.
- R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
- J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, pp. 1889–1897, 2015.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” 2014.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
- M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in neural information processing systems, pp. 5048–5058, 2017.
- S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” arXiv preprint arXiv:1802.09477, 2018.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
- Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, pp. 1329–1338, 2016.
- J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, p. eaau5872, 2019.
- Y. Fujita, T. Kataoka, P. Nagarajan, and T. Ishikawa, “Chainerrl: A deep reinforcement learning library,” arXiv preprint arXiv:1912.03905, 2019.
- T. garage contributors, “Garage: A toolkit for reproducible reinforcement learning research.” , 2019.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Footnotes
-
While a policy ultimately provides an action, an alternative formulation following existing literature formalizes a stochastic policy as
or where is the parameter space of the action distribution e.g. a Gaussian distribution with where is the mean action and the corresponding variance. ↩ -
In RL literature, the advantage function
for any policy is commonly defined as the difference in the state-action value and the action value , that is . ↩