Policy Gradient Methods and Proximal Policy Optimization Algorithms

Original paper · Schulman et al 2017

PPO has become a staple of policy gradient methods in reinforcement learning over the past couple of years and given its usage in modern LLM's, as a feedback loop optimizer, I've been meaning to dive into it for some time now. I'm quite familiar with RL from uni courses/projects but unfortunately only the deterministic Q-learning side of RL with very little focus on policy gradient methods. This post culminates from a few different RL resources in combination with the original PPO paper, all of which I thoroughly recommend:

I'm going to cover some general Deep RL topics that I've come across in my reading then gradually move towards policy gradient methods and finally PPO specific thoughts.

Model-free vs Model-based RL

The main branching points for an RL algorithm is whether the agent has access to (or learns) a model of its environment. In this case a model represents a function predicting the environments state transitions and rewards. An agent that has access to such a model can see large benefits by being able to think ahead and plan into a learned policy, a famous example of this is AlphaZero. However, a ground-truth model of the environment is hard to come by and instead such a model is usually inferred from experience. This means the agent will have to avoid exploiting any bias in the learned model otherwise it will perform terribly bad when placed in the real environment. The alternative to this approach is model-free RL algorithms - which tend to be easier to tune and implement but forgo the potential gains in sample efficiency.

Q-Learning vs Policy Optimization

When it can be applied (given its deterministic nature and inability to learn stochastic policies) Q-Learning is substantially more sample efficient as compared to policy optimization. It gains a lot of its strength from being able to reuse data more efficiently than policy estimation. However, policy optimization methods have seen more practical usage because its more natural to handle continuous actions and we directly optimize for the thing we want rather than implicitly improving agent performance through a action-value function.

Policy gradients

In its simplest form, a policy gradient method is an algorithm to maximize the expected return, where the return is the cumulative reward over a trajectory. We achieve this by iteratively optimizing our policy through gradient ascent

$\theta_{k+1} = \theta_k + \alpha \left. \nabla_{\theta} J(\pi_{\theta}) \right|_{\theta_k}.$

We can imagine this as pushing the policy in the direction which maximizes the expected return. Note that this requires gradient ascent as opposed to descent as we are maximizing. The gradient of policy performance, $\nabla_{\theta} J(\pi_{\theta})$ , is called the policy gradient, and algorithms that optimize the policy this way are called policy gradient algorithms. Now we just need a way to represent the policy gradient numerically which means deriving the analytical gradient of policy performance. This turns out to be an average over all possible state-action trajectories (an expected value), but because this is infeasible to compute we instead approximate this expectation by letting the agent perform a number of episodes resulting in a sample estimate. This is the foundation for a policy gradient algorithm.

Baselines in policy gradients

The Spinning Up series from OpenAI broadened my understanding of how to formulate the policy gradient. In its general form the policy gradient is expressed as

$\nabla_{\theta} J(\pi_{\theta}) = E_{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \Phi_t}$

where $\Phi_t$ could take many different forms involving $R(\tau)$ . In the reward-to-go policy gradient we make sure to only reinforce actions on the basis of their consequence rather than the sum of all rewards ever obtained. In addition we can also introduce a baseline function (importantly not dependant on the action).

PPO

In reference to the section above, PPO employs the value function as its baseline in policy gradients. In addition to this it provides two major contributions that are of interest to me. The Clipped Surrogate Objective and the use of multiple epochs of stochastic gradient ascent to perform each policy update

TRPO is a RL algorithm which stabilizes training by limiting the change of your policy at each gradient step. To implement this TRPO added a bunch of bells and whistles, all to guarantee a monotonic improvement of its policy. PPO aims to build these stabilizing properties directly into the objective function. It does this by creating a clipping function that when implemented only lets through small improvements of 1 + eps while still allowing the policy to completely revert "backwards" if the previous policy update resulted in a worse policy.

Thanks to this Clipped Surrogate Objective we are also able to optimize for multiple passes over the data thus reducing data inefficiency which is a big issue in RL. For vanilla policy gradient methods this usually fails because they take too big steps on the local samples and wreck the policy. With CRO however we run our policy on N different actors generating N trajectories. We then minibatch these and train for K epochs over the samples. In the beggining our policy will be equal to the old policy meaning that we are guaranteed to learn something from these examples. As we update our policy over the epochs the objective will start hitting the clip limits and our gradients will reduce to 0 stopping the training.