For improving the policy of a reinforcement learning setup, it is customary to apply policy gradients on the usual online variants that are on-policy. In order to take advantage of the off-policy data as well, the paper seeks to combine online calculations of policy gradients with off-policy style Q-learning. Drawing inspiration from replay buffer, the proposed technique makes a connection between the estimation of Q-values from the action preferences of the policy and the regularized fixed points of the policy gradients. Known as Policy Gradient and Q-learning (PGQL), this technique has been successfully tested on the full-suite of Atari games and benchmarked against Action Critic (A3C) and Q-learning algorithms. The numerical examples cited in the paper have demonstrated both improved data efficiency and stability in performance.
Prior to this work, most model-free reinforcement learning fell into one of two types of algorithms: 1) action-value fitting, where the Q-values are fitted, and 2) policy gradient, where the policy is represented explicitly and improved by updating it in the direction of the gradient of improved performance.
Q-learning is a model-free method to find the Q-values for the optimal policy without changing the policy that was used to generate that trajectory, where each Q-value represents the “quality” or expected reward of a given state-action pair. This is known as off-policy, since learning can be performed after the fact and independently from the agent used to make those policy decisions. Policy gradient methods are normally implemented as an online policy as they require an estimate of action-value of the current policy.
This paper combines policy gradient and Q-learning, but doing so requires developing a formal connection between the two: first the authors derive a link between the fixed point of the policy gradient algorithm and its Q-values. Then they also show that any regularized actor-critic method (which includes policy gradient) can be seen as an action-value fitting method when the Q-values are described in a certain way.
We begin with some definitions: Given a set of states and a set of actions per state, a policy maps a state in along with an action in , with the probability of choosing that action. The goal is to maximize the expected discounted return, where the discount factor sets how much the agent will prioritize long-term over short-term rewards. We will also make use of the Bellman operator,
where the expectation is taken over the next state , the reward and the action from the policy .
The authors show that the error between the traditional Q-values and the Q-values estimated from following a regularized policy gradient are orthogonal to , and thus is a locally optimized estimate of the Q-value. This is beneficial since following a regularized policy gradient will approximate the optimal Q-values.
Separately, the authors show that actor-critic methods can be interpreted as action-value fitting method. Choosing a suitable parametrization, leads to identical updates in both approaches, and for small ,the Bellman residuals are sufficiently small that the Q-values converge to the fixed point of the Bellman equation, which are the optimal Q-values.
The main contribution of the work is the hybrid policy gradient and Q-learning, called “PGQL” technique. The method combines the two types of reinforcement learning: action-value fitting (Q-Learning) and policy gradient technique. In particular, O’Donoghue et al. demonstrate that PGQL can outperform both the Asynchronous Advantage Actor-Critic (A3C), developed by Google’s DeepMind in 2016, and Q-Learning.
For context, A3C is a deep learning technique that, unlike Q-Learning, uses multiple agents. Each agent asynchronously interacts with their own network parameters and a copy of the environment. Agents are controlled by the global network. When it comes to Actor-Critic, the A3C algorithm combines the best practices of action-value fitting and policy-gradient methods. Simply put, A3C predicts the value function (how good is a certain state) and policy function (probability of output given by actions).
The paper refers to the Bellman equation, which specifies how the agent can update state-value estimation by taking a reward of its current state together with the state-value of the next state .
It is also shown that at a fixed point, the Bellman’s residuals will be small for the regularized penalty parameter that works the same way as the learning rate in gradient descent algorithm. To achieve the goal, O’Donoghue et al. combine two updates to the policy: regularized policy gradient update and the Q-Learning update. According to the authors, the update can be interpreted as an actor-critic algorithm with optimizing critique and therefore minimizing the Bellman equation’s residuals.
The authors proposed the following implementation scheme. One or more agents interact with the environment, and perform on-policy updates given the states and perceived rewards at those states. The agent(s) use an actor-critic algorithm to update shared parameters. When the agent processes new data from the environment, it stores it in a “shared memory buffer”. The samples from the buffer allow each learner to separately learn the policy parameters (step Q-Learning).
There are two main advantages to this approach:
Firstly, the critic can accumulate Monte Carlo return over time for many time periods. Therefore, the influence of future rewards can be distributed backwards in time. Secondly, the memory buffer can replay the most important experiences and prioritize them. The replay buffer that stores important past experiences from sample prioritization, can work as a regularizer and hence make sure the policy will still satisfy the Bellman equation.
Experiments: Grid World and Atari
The work discussed PGQL implementation in two settings, Grid World and Atari.
First, Grid World is a “toy” 4 by 6 grid environment (a). The agent always begins on the square labelled “S”, depicted in the image below. The game ends when the agent reaches “T” where it receives a reward of 1.
In addition to PGQL, O’Donoghue et al. let the other two agents, Actor-critic (TD-Actor Critique) and Q-Learning, learn the same Grid world (picture above). The agent always started in “S” and tried to learn reach the end point “T”. The plot (b) shows true expected performance of the policy. The algorithms perform as follows: Actor-critic uses value function to generate an estimate of Q-values that should work as the critic. Q-Learning uses data from the buffer of prior experience to update at each step and parametrize Q-values. The experiment resulted in findings that actor-critique agents failed to learn a path to “T” at step-sizes larger than 1. Contrary, both PGQL and Q-learning could learn with an increase in the step-size. O’Donoghue et al. argues that the observation can be explained by the stabilizing effect of using replay. Clearly, PGQL outperforms the two other algorithms.
Secondly, the authors used a neural network to parametrize the policy. They performed the testing in the full suite of Atari benchmarks consisting of over 55 different games. The figure below demonstrates how O’Donoghue et al. designed a policy network. PGQL’s architecture is is similar to A3C. However, PGQL is augmented with a parameterless additional layer which outputs Q-value estimate. Except the layer, the authors chose architecture and parameters identical to AC3.
Methods compared: PGQL, Asynchronous advantage actor critic (A3C) model, and variant of asynchronous deep Q learning. For each method, all games (total 57 games) used identical hyperparameters.
Overall, PGQL performed best in 34 games (60%), A3C performed best in 7 games (12%) and Q-learning was best in 10 games (18%). In 6 games (10%), 2 or more methods tied.
In Table 1 and 2, the authors give the mean and median normalized scores as a percentage of an expert human normalized score, across all games for each tested algorithm from random and human-start conditions respectively.
As shown below, in both cases, PGQL has both the highest mean and median, and its median score exceeds 100%, the human performance threshold.
In Figure 3, we see some games where PGQL was the best performer and had far better data efficiency (taken lesser steps to achieve high score) than other methods.
In Figure 4, we see some games where PGQL under-performed. In these cases, though PGQL had better data efficiency early on, performance later saturated or collapsed.
The authors hypothesize that this could be due to policy achieving a local optimum or over fit to early data and might perform better after tuning hyperparameters.
The authors have successfully combined the A3C architecture using on-policy inputs with an additional final layer of Q learning from stored experience. The resulting method has been called PGQL, for policy gradient and Q-learning.
PGQL shows better data efficiency and stable results on a suite of Atari games when compared to A3C or Q-learning alone.
Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, Volodymyr Mnih:
PGQ: Combining policy gradient and Q-learning. CoRR abs/1611.01626 (2016)