Reinforcement Learning (RL) is concerned with the problem of an agent trying to maximize a scalar reward signal through interaction within its environment [1]. During the process no supervision is being provided i.e., the agent is never directly informed about the best action at a given moment in time. Based on advances in Deep Learning, the field has celebrated success in complex tasks i.e., solving various video games [2] and board games like chess and go [3]. The Deep Q-Network used by [1] is considered a model-free RL algorithm i.e., no explicit function for modeling the dynamics of the system is being approximated. On the other hand, model-based methods learn a dynamics model of the environment from observed data. While these model-based methods are arguably considered to be more sample efficient, they suffer from model bias, i.e., they inherently assume that the learned dynamics model sufficiently accurately resembles the real environment [4]. Model bias is especially an issue when only a few samples and no informative prior knowledge about the task to be learned are available. Therefore, methods like [5] that use probabilistic function approximators like Gaussian Processes [6], which can incorporate uncertainty into model approximation, can cope with bias to allow for more robust long term planning. Alternatives to GP function approximators which incorporate uncertainty but scale better with the number of available data points are Bayesian Neural Networks [7]. Nonetheless, uncertainty estimation becomes difficult or even unfeasible under harder system dynamics like complex locomotion or grasping. For harder dynamics, model-free algorithms offer the alternative to directly and therefore only learning the quantity of interest i.e., (the parameters of) a policy for optimal behavior within the environment. A real-world sports analogy is baseball where the performance of a pitcher\footnote{In baseball, a pitcher is a person that tries to throw the ball to the catcher while retiring the batter who is placed in between.} is judged by the actual throw of the ball rather than his reasoning over the physical dynamics of the ball throw. The methods presented in the following are policy gradient methods [8,9], a subclass of model-free RL algorithms. Policy gradient methods rely upon gradient descent optimization of parametrized policies with respect to the expected return or long term cumulative reward. Mathematically formalized, the RL scenario is given as a Markov Decision Process M=(S,A,D,R) where S and A are the state and action spaces and D:S×AS is the dynamics function of the system and R:S×AR is the scalar reward signal, and the optimization problem which seeks to find the optimal policy πθθ:SA is formalized as (1)θθ=argmaxθθE[t=0rt] where rt=R(st,at=πθθ(st)) with sS and aA being the state and the agent’s action respectively. The policy gradient methods try to solve the problem formulated in Equation 1 above through estimation of the gradient g:=θθE[t=0rt]. The policy gradient methods to be presented in this section can formally be generalized to a policy gradient g that satisfies (2)g=E[t=0ϕtθθlogπθθ(atst)] where in abuse of notation πθθ(atst) refers to the probability1 of taking action at from state st during timestep tN, and ϕt can take on various forms e.g., ϕt=E[t=0rt] corresponds to vanilla policy gradients i.e., REINFORCE [8]. The vanilla policy gradient is arguably the simplest form of a policy gradient method and the main disadvantage of the algorithm stems from the variance during estimation. Therefore, subsequently developed methods based on the policy gradient theorem have introduced mathematical structures that can help in controlling this variance. In the following, arguably state-of-the-art policy gradient methods, which will be used for the systematic experimentation phase presented in the following chapter, are being introduced in more detail. Trust Region Policy Optimization [10] improves the training stability by restricting the possible change of the policy during update phase at each step. This optimization constraint is enforced by the introduction of a trust region constraint formalized using a Kullback-Leibler divergence between the previous policy and the new candidate policy. I.e., Equation 1 is adapted and extended to (3)θθ=argmaxθθJθθold(θθ)subject toDKL(θθold,θθ)δ where δR and J:=E[ψt(θθ)At] where At is an advantage function2 estimate and (4)ψt(θθ)=πθθ(atst)πθθold(atst) being a weight ratio for an action at under the two considered policies, and thus J represents the (surrogate) objective function (e.g. as previously the expected return solely). The authors further generally prove the monotonic improvement for an algorithm that repeatedly optimizes a local approximation to the expected return of the policy with a KL divergence penalty, and that their methodic approximation to this principle through incorporation of a KL divergence constraint achieves good empirical results. Proximal Policy Optimization [11] simplifies the notion introduced by TRPO by simply reformulating the objective to control the range of values the function ψ can take on i.e., (5)J(θθ)=E[min(ψt(θθ)At,cϵ(ψt(θθ)))] where cϵ:R[1ϵ,1+ϵ] is a clipping function with ϵR. Empirically PPO is arguably more sample efficient than TRPO, while also being a method of greater simplicity. TRPO and PPO belong to the same family of policy gradient algorithms and both implement stochastic policies i.e., the policy is a function which outputs the specification μμ and ΣΣ of a Gaussian distribution (i.e., aN(μμ,ΣΣ)). Thus, actions are being sampled from that distribution at runtime (or alternatively the mean of the distribution is simply being considered). Another family of policy gradient methods makes use of deterministic policies [12]. Deterministic policies can be seen as a special case of the more general stochastic setting. Stochasticity is useful or even crucial to exploration during training. Therefore, deterministic policies need additional investment to compensate for adequate exploration. However, their simpler form allows for higher computational efficiency. The Deep Deterministic Policy Gradient [13] method belongs to this family of algorithms. The algorithm combines concepts from deterministic policy gradients with DQN [2] which helps in stabilizing training through target network updates and experience replay. Intuitively, the mechanism of experience replay corresponds to a task memory or history of the agent’s actions from which it samples during learning. Improving the experience replay procedure i.e., changing the way information is being stored and accessed, can significantly change the task performance of the agent, as has been shown by [14]. Another method belonging to the family of deterministic policy gradients which tries to improve upon concepts from DDPG is Twin Delayed DDPG [15]. The authors argue that the overestimation of the value function, which guides the learning of optimal policies, leads to sub-optimal policies. Given that they further conclude that this issue is identifiable within training procedures, they introduce improvements via clipping of target values, less frequent updates of the policy compared to the value function and smoothing or regularization of the target policy. Another actor-critic method Soft Actor Critic [16] alters the original objective by introducing a maximum entropy regularization. I.e., the Equation 5 would be changed to (6)J(θθ)=E[rt+αH(πθθ(sst)] where H is the entropy measure with free parameter α often referred to as temperature. The introduction of the entropy term leads to higher exploration. SAC is an off-policy algorithm like DDPG or TD3 i.e., it does not estimate the value function using actions generated from the current but a separate policy, but uses stochastic policies like the on-policy methods TRPO and PPO. The six algorithms presented in this section have been widely applied and benchmarked to various RL tasks [17], including robotics [18]. Open-source implementations of these methods exist and are being provided by the Python libraries ChainerRL [19] and garage [20] (which is the successor to the rllab framework). As has been observed, the mentioned state of the art arguably differs only partially to any other competitive method belonging to the same algorithmic class i.e., the methodological variance within the family of policy gradient methods is arguably low. Therefore, a method can benefit from a combination of additional techniques that have been empirically shown to be supportive of learning (e.g. entropy regularization) or improve computational efficiency (e.g. clipping). The work in [21] performs an ablation study on various factors that improve upon baseline versions of the policy gradient class of model-free algorithms.

References

Footnotes

  1. While a policy ultimately provides an action, an alternative formulation following existing literature formalizes a stochastic policy as πθθ:S×A[0,1] or πθθ:SΨ where Ψ is the parameter space of the action distribution e.g. a Gaussian distribution N(μμψψ,ΣΣψψ) with ψψΨ where μμψψ is the mean action and ΣΣψψ the corresponding variance. 

  2. In RL literature, the advantage function A:S×AR for any policy πθθ is commonly defined as the difference in the state-action value Q(s,a) and the action value V(a), that is A:=QV