This blog post is the collective work of the participants of the “RL” workshop organized by Aggregate Intellect. This post serves as a proof of work, and covers some of the concepts covered in the workshop in addition to advanced concepts pursued by the participants.


Author: Willy Rempel
Willy Rempel

Reinforcement Learning Introduction

The foundation of all Reinforcement Learning (RL) is of an agent that acts on, and is acted on by its environment.

The agent-environment relationship forms a closed loop:

  1. the environment receives from the agent an action

  2. this action can change the environment. That is, the environment changes state

  3. the agent then receives from the environment both state and reward information

This loop can be formalized as a Markov Decision Process (MDP),

p(s,rs,a)=Pr(St=s,Rt=rSt1=s,At1=a)p(s',r \vert s,a) = Pr(S_t=s', R_t=r \mapsto S_{t-1}=s, A_{t-1}=a)

where the next state ss' and its associated reward rRr \in \mathbb{R}, is only dependant on the previous state ss, and the action taken aa. Every part of this system can be elaborated upon, and from this follows all the rest of the field.

Environments can be discrete or continuous. State changes can be deterministic or not. Interaction loops can be finite in nature (episode, trace) or indefinite. An environment can be described by its states, the allowed actions at each state, and a transition function that returns the next state.

The agent’s goal is to maximize its total reward over time. Thus it needs to choose a particular sequence of actions that will achieve this. Agents choose their actions based on a policy function μ(s)\mu(s) if it is deterministic, or π(s)\pi(s) if it is probabilistic. The policy can be as simple as a static look-up table, or as complex as a large, deep neural network. An ϵgreedy\epsilon-greedy policy is one where the agent chooses a random action with probability ϵ\epsilon, or the maximally rewarding action otherwise. This highlights the trade-off between exploitation and exploration that is a common theme of policies.

When the agent-environment loop is indefinite in duration (non episodic), the rewards cannot be simply summed. A discounting factor 0<γ<10 < \gamma < 1 is used to attenuate future rewards:

Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...G_t = R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \gamma^{3}R_{t+4} + ...

Agents also have either a value function v(s)v(s) that returns a value for state s, or q(s,a) that values a state-action pair (the value of taking action a while in state s).

Both value functions satisfy a recursive relationship expressed by the Bellman equations

vπ(s)=Eπ[k=0γkRt+k+1St=s], for all sSv_{\pi}(s) = \mathop{\mathbb{E}_{\pi}}\left[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}S_{t}=s\right],\ for \ all \ s \in S
qπ(s,a)=Eπ[k=0γkRt+k+1St=s,At=a], for all sSq_{\pi}(s,a) = \mathop{\mathbb{E}_{\pi}}\left[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}\big\vert S_{t}=s, A_{t}=a\right],\ for\ all\ s \in S

Almost all reinforcement learning algorithms are General Policy Iteration (GPI) methods:

  • Maintain approximate value and policy functions

  • The policy is iteratively improved with respect to the value function, while the value function is evaluated with respect to the policy

  • This feedback loop converges to optimal policy and value functions

  • the value function is used to structure and constrain the policy search

Dynamic Programming algorithms are at one extreme of RL methods requiring a perfect model of the environment and typically exponential computation cost. They are used for the theoretical underpinnings of reinforcement learning as opposed to practical use. Briefly, dynamic programming involves finding optimal solutions by progressively building from optimal solutions to sub-problems.

At the other extreme Monte Carlo (MC) methods have no model and rely soley on experience from agent-environment interaction. The value of a state s is computed by averaging over the total rewards of several traces starting from s. These methods require completing entire episodes before the value function can be updated.

Temporal Difference (TD) learning is an invaluable approach that combines advantages from both DP and MC methods. As the name implies, valuation updates are done recursively by the difference between time steps. It does not require an environment model like DP, and unlike MC it can update prior to episode completion. Like MC, it learns directly from experience, and like DP it iteratively updates estimates. This is easiest to show by comparing value function update rules:

Monte Carlo

V(St)V(St)+α[GtV(St)]V(S_{t}) \leftarrow V(S_{t}) + \alpha[G_{t} - V(S_{t})]

Dynamic Programming

vπ(s)=Eπ[Rt+1+γGt+1St=s]v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1}\vert S_{t} = s]

Temporal Difference

V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_{t}) \leftarrow V(S_{t}) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})]

where α\alpha is the learning rate, and the term Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1}) is the updated estimate of V(St)V(S_{t}) SARSA is a TD algorithm that is an ‘on-policy’ learning method. On-policy methods evaluate and improve the same policy π\pi that is used to make the action decisions.

In contrast, an off-policy method uses two policies: a behavioural policy bb that is more amenable to explore traces outside of current optimal estimates, and the target policy π\pi to be optimized.

Q-learning is an off-policy TD method. It is defined by:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]Q(S_{t},A_{t}) \leftarrow Q(S_{t},A_{t}) + \alpha[R_{t+1} + \gamma\max_{a}Q(S_{t+1},a) - Q(S_{t}, A_{t})]

Deep Q-learning (DQN) uses deep neural networks for the policy and value functions. The cost function for DQN is

((DQNnet(St,a)(r+γmaxaDQNnet(St+1,a))))2\big(( DQN_{net}(S_{t},a) - (r + \gamma\max_{a}DQN_{net}(S_{t+1},a))\big))^{2}

An important addition to the architecture is experience replay by the use of a memory D. Agent experiences are stored as tuples et=(st,at,rt,st+1)e_{t} = (s_t,a_t,r_t,s_{t+1}) over many episodes. During training minibatches of samples are taken from D at random for standard deep learning optimization.

Rainbow DQN combines several architectural innovations to vanilla DQN that have proven to be beneficial:

  • Double deep Q-Learning

  • Duelling DQN

  • Action Advantage

  • Noisy Networks

  • Multi-step Learning

  • Prioritized Experience Replay

Rainbow DQNs will not be covered here. Please see for more.

Lastly, Policy Gradient Methods directly optimize a parameterized, differentiable policy function. This approach does not require the use of the value function for action selection. For example, the REINFORCE Monte Carlo Policy Gradient algorithm trains policy π(aSt,θ)\pi(a \vert S_{t},\theta) with parameters θ\theta by the update rule

θt+1=θt+αγtGtπ(AtSt,θt)π(AtSt,θt)\theta_{t+1} = \theta_{t} + \alpha \gamma^{t} G_{t}\frac{\nabla\pi(A_{t}\vert S_{t}, \theta_{t})}{\pi(A_{t}\vert S_{t},\theta_{t})}

where α\alpha is the learning rate, γ\gamma is the discounting factor, and GtG_{t} is the total episodic return. A useful algebraic trick is to reformulate the right-most term above as

π(AtSt,θt)π(AtSt,θt)=lnπ(AtSt,θt)\frac{\nabla\pi(A_{t}\vert S_{t}, \theta_{t})}{\pi(A_{t} \vert S_{t},\theta_{t})} = \nabla ln \pi(A_{t}\vert S_{t},\theta_{t})

In this superblog, firstly we discuss the Policy Gradient and Q-learning (PGQL) technique, which combines the advantages of the two classic reinforcement learning techniques. Secondly, the hierarchical Deep Q-Networks (h-DQN) technique is detailed, which builds upon previous DQN literature in the way that reward value functions are formulated.

We then outline Data Driven Discovery of Models (D3M), which contributes to automated machine learning (AutoML) by leveraging Monte Carlo Tree Search (MCTS), a technique commonly used in reinforcement learning. Finally, we walk through a reinforcement learning use case of financial predictions which incorporates LSTM into the architecture.

Combining policy gradient and Q-learning

Author: Ramya BalasubramaniamAuthor: Yony BreslerAuthor: Jiri StodulkaAuthor: Madhur Kanjolia
Ramya BalasubramaniamRB wrote the outline section
Yony BreslerYB wrote the introduction section
Jiri StodulkaJS wrote the theory section
Madhur KanjoliaMK wrote the results and conclusion sections


For improving the policy of a reinforcement learning setup, it is customary to apply policy gradients on the usual online variants that are on-policy. In order to take advantage of the off-policy data as well, the paper seeks to combine online calculations of policy gradients with off-policy style Q-learning. Drawing inspiration from replay buffer, the proposed technique makes a connection between the estimation of Q-values from the action preferences of the policy and the regularized fixed points of the policy gradients. Known as Policy Gradient and Q-learning (PGQL), this technique has been successfully tested on the full-suite of Atari games and benchmarked against Action Critic (A3C) and Q-learning algorithms. The numerical examples cited in the paper have demonstrated both improved data efficiency and stability in performance.


Prior to this work, most model-free reinforcement learning fell into one of two types of algorithms: 1) action-value fitting, where the Q-values are fitted, and 2) policy gradient, where the policy is represented explicitly and improved by updating it in the direction of the gradient of improved performance.

Q-learning is a model-free method to find the Q-values for the optimal policy without changing the policy that was used to generate that trajectory, where each Q-value represents the “quality” or expected reward of a given state-action pair. This is known as off-policy, since learning can be performed after the fact and independently from the agent used to make those policy decisions. Policy gradient methods are normally implemented as an online policy as they require an estimate of action-value of the current policy.

This paper combines policy gradient and Q-learning, but doing so requires developing a formal connection between the two: first the authors derive a link between the fixed point of the policy gradient algorithm and its Q-values. Then they also show that any regularized actor-critic method (which includes policy gradient) can be seen as an action-value fitting method when the Q-values are described in a certain way.

We begin with some definitions: Given a set of states SS and a set of actions AA per state, a policy π\pi maps a state ss in SS along with an action aa in AA, with the probability of choosing that action. The goal is to maximize the expected discounted return, where the discount factor λ(0,1)\lambda\in(0,1) sets how much the agent will prioritize long-term over short-term rewards. We will also make use of the Bellman operator,

TpiQ(s,a)=Es,r,b(r(s,a)+γQ(s,b))\mathfrak{T}^pi Q\left(s,a\right) = \mathbf{E}_{s',r,b}\left(r\left(s,a\right)+\gamma Q\left(s',b\right)\right)

where the expectation E\mathbf{E} is taken over the next state ss', the reward r(s,a)r\left(s,a\right) and the action bb from the policy πs\pi_{s'}.

The authors show that the error between the traditional Q-values and the Q-values estimated from following a regularized policy gradient are orthogonal to , and thus is a locally optimized estimate of the Q-value. This is beneficial since following a regularized policy gradient will approximate the optimal Q-values.

Separately, the authors show that actor-critic methods can be interpreted as action-value fitting method. Choosing a suitable parametrization, leads to identical updates in both approaches, and for small α\alpha,the Bellman residuals are sufficiently small that the Q-values converge to the fixed point of the Bellman equation, which are the optimal Q-values.


The main contribution of the work is the hybrid policy gradient and Q-learning, called “PGQL” technique. The method combines the two types of reinforcement learning: action-value fitting (Q-Learning) and policy gradient technique. In particular, O’Donoghue et al. demonstrate that PGQL can outperform both the Asynchronous Advantage Actor-Critic (A3C), developed by Google’s DeepMind in 2016, and Q-Learning.

For context, A3C is a deep learning technique that, unlike Q-Learning, uses multiple agents. Each agent asynchronously interacts with their own network parameters and a copy of the environment. Agents are controlled by the global network. When it comes to Actor-Critic, the A3C algorithm combines the best practices of action-value fitting and policy-gradient methods. Simply put, A3C predicts the value function V(s)V(s) (how good is a certain state) and policy function π(s)\pi(s) (probability of output given by actions).


The paper refers to the Bellman equation, which specifies how the agent can update state-value estimation by taking a reward of its current state rtr_{t} together with the state-value of the next state rt+1r_{t+1}.

Bellman Equation

It is also shown that at a fixed point, the Bellman’s residuals will be small for the regularized penalty parameter α\alpha that works the same way as the learning rate in gradient descent algorithm. To achieve the goal, O’Donoghue et al. combine two updates to the policy: regularized policy gradient update and the Q-Learning update. According to the authors, the update can be interpreted as an actor-critic algorithm with optimizing critique and therefore minimizing the Bellman equation’s residuals.


The authors proposed the following implementation scheme. One or more agents interact with the environment, and perform on-policy updates given the states and perceived rewards at those states. The agent(s) use an actor-critic algorithm to update shared parameters. When the agent processes new data from the environment, it stores it in a “shared memory buffer”. The samples from the buffer allow each learner to separately learn the policy parameters (step Q-Learning).

There are two main advantages to this approach:

Firstly, the critic can accumulate Monte Carlo return over time for many time periods. Therefore, the influence of future rewards can be distributed backwards in time. Secondly, the memory buffer can replay the most important experiences and prioritize them. The replay buffer that stores important past experiences from sample prioritization, can work as a regularizer and hence make sure the policy will still satisfy the Bellman equation.

Experiments: Grid World and Atari

Grid World

The work discussed PGQL implementation in two settings, Grid World and Atari.

First, Grid World is a “toy” 4 by 6 grid environment (a). The agent always begins on the square labelled “S”, depicted in the image below. The game ends when the agent reaches “T” where it receives a reward of 1.

Grid World Map

In addition to PGQL, O’Donoghue et al. let the other two agents, Actor-critic (TD-Actor Critique) and Q-Learning, learn the same Grid world (picture above). The agent always started in “S” and tried to learn reach the end point “T”. The plot (b) shows true expected performance of the policy. The algorithms perform as follows: Actor-critic uses value function to generate an estimate of Q-values that should work as the critic. Q-Learning uses data from the buffer of prior experience to update at each step and parametrize Q-values. The experiment resulted in findings that actor-critique agents failed to learn a path to “T” at step-sizes larger than 1. Contrary, both PGQL and Q-learning could learn with an increase in the step-size. O’Donoghue et al. argues that the observation can be explained by the stabilizing effect of using replay. Clearly, PGQL outperforms the two other algorithms.

Performance versus agent steps in grid world

Secondly, the authors used a neural network to parametrize the policy. They performed the testing in the full suite of Atari benchmarks consisting of over 55 different games. The figure below demonstrates how O’Donoghue et al. designed a policy network. PGQL’s architecture is is similar to A3C. However, PGQL is augmented with a parameterless additional layer which outputs Q-value estimate. Except the layer, the authors chose architecture and parameters identical to AC3.

PGQL Network Augmentation Chart


Methods compared: PGQL, Asynchronous advantage actor critic (A3C) model, and variant of asynchronous deep Q learning. For each method, all games (total 57 games) used identical hyperparameters.

Overall, PGQL performed best in 34 games (60%), A3C performed best in 7 games (12%) and Q-learning was best in 10 games (18%). In 6 games (10%), 2 or more methods tied.

In Table 1 and 2, the authors give the mean and median normalized scores as a percentage of an expert human normalized score, across all games for each tested algorithm from random and human-start conditions respectively.

As shown below, in both cases, PGQL has both the highest mean and median, and its median score exceeds 100%, the human performance threshold.

model performance

In Figure 3, we see some games where PGQL was the best performer and had far better data efficiency (taken lesser steps to achieve high score) than other methods.

Atari runs where PGQL ran well

In Figure 4, we see some games where PGQL under-performed. In these cases, though PGQL had better data efficiency early on, performance later saturated or collapsed.

The authors hypothesize that this could be due to policy achieving a local optimum or over fit to early data and might perform better after tuning hyperparameters.

Some Atari runs where PGQL performed poorly


The authors have successfully combined the A3C architecture using on-policy inputs with an additional final layer of Q learning from stored experience. The resulting method has been called PGQL, for policy gradient and Q-learning.

PGQL shows better data efficiency and stable results on a suite of Atari games when compared to A3C or Q-learning alone.


Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu, Volodymyr Mnih:

PGQ: Combining policy gradient and Q-learning. CoRR abs/1611.01626 (2016)

Hierarchical Deep Reinforcement Learning- Integrating Temporal Abstraction and Intrinsic Motivation

Author: Robert Hensley
Robert HensleyRH has solely contributed to this section

Hierarchical RL Summary

The authors of “Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation” call their proposed system a hierarchical DQN (h-DQN). In their paper, they build upon previous DQN works that proposed learning “options” in real time, varying reward functions, and generalized value functions which consider “goals” in addition to states. Motivated by the open challenge of learning goal directed behaviour in environments with sparse rewards, the authors introduce a meta-controller and controller to use with the actor-critic paradigm.

At a high level, the basic system has a meta-controller that predicts the values of a set of intrinsic goals (given a state), a controller that predicts the values of a set of actions (given a state and a goal), and a critic that evaluates whether the intrinsic goal has been reached, and then provides a reward. The meta-controller has an objective function that maximizes the extrinsic reward, which is given by the environment, while the controller’s objective function maximizes the intrinsic reward, which is given by the critic.

A key challenge in this meta-controller’s task, and a open area in the field, is object detection. In order to give the meta-controller goals to choose from, the authors build a custom object detector to identify entities in their environments. In the ATARI game, ‘Montezuma’s Revenge,’ examples of these entities include the key, middle-ladder, top-right-door and so on.

Screenshot from "Montezuma's Revenge"

Once identified, entities are masked by graying out some of the goal image space. These entities become the intrinsic goals, which the meta-controller selects from and provides to the controller. Some entities are also extrinsic goals. In the case of Montezuma’s Revenge, the key rewards +100 points from the environment, and a door rewards +300 points from the environment (for a maximum of 400 points).

The structure of the meta-controller and controller are both 3 layer convolutional neural networks, followed by 1 linear layer, and 1 output layer. Every layer is followed by ReLU nonlinearities, except the output layer, which outputs Q values. The meta-controller takes in four images (of size 84X84) as its input, and outputs Q values for the possible goals, while the controller takes in four images (84X84) and the selected goal, identified by an appended mask, and then outputs Q values for the possible actions. The critic then takes in the action that is output by the controller, and rewards the agent if the intrinsic goal is reached. Then, depending on the outcome, the controller-critic cycle continues, or the meta-controller provides a new goal, until the extrinsic environment goals are reached, or until termination.

Meta-controller struture

While the controller, continuously tries to reach the intrinsic goal, the authors view the critic as existing in <entity1, relation, entity2> space, tasked with awarding the reward rr. As a system the model learns to prefer different intrinsic goals. For example, the figure from their results (see below) shows how the system picks the different intrinsic objects/goals over time, for Montezuma’s Revenge:

success % of different goals over time

As the agent progresses, the transitions made are stored in ‘D2’ meta-controller memory and ‘D1’ controller memory. The meta-controller memory stores “states, goals, extrinsic rewards, and next states” for each set of ‘Transitions.’ Likewise, the controller memory stores values for each of its transitions in the form of, “state_goal_pairs, actions, intrinsic rewards, next states.” While the ‘D1’ memory is updated every action, the ‘D2’ memory is updated once the target goal has been reached, or once the episode has ended.

When an environment has goals that exist in sparse spaces, DQNs have trouble reaching their goals reliably. However, generally, the h-DQN has shown it can consistently reach these extrinsic environment goals, by setting intrinsic goals, much more reliably than DQNs. Below is a graph of the total extrinsic rewards received by a standard DQN vs the h-DQN approach, in Montezuma’s Revenge:

totoal extrinsic rewards received by a standard DQN vs. h-DQN

The h-DQN learns to reliably move toward the maximum extrinsic reward (400 points), whereas the DQN has no success.

In addition to Montezuma’s Revenge, the authors also perform a smaller experiment in a simpler system, referred to as the “discrete stochastic decision process.” The rules for the game in this experiment are simple: they have 6 states, and start at state 2. The agent receives a reward of 1 if the agent visits state 6 before state 1 (state 1 being the terminal state), and the agent receives a reward of .01 if the agent never visited state 6 before arriving at state 1. There are two possible actions, left and right, where right actually goes left 50% of the time, when selected. This creates an impedance toward visiting the furthest rightward state (state 6).

A stochastic decision process chart DAG

In this experiment, the authors show that the base model, standard Q learning with no intrinsic goals, learns a suboptimal policy of going directly to the terminal state (state 1) in order to receive the .01 reward. While their h-DQN, which does set intrinsic goals, learns to visit state 6 before going to the terminal state. In the charts below, we can see that their approach learns to consistently achieve a higher reward over time, by visiting state 6 increasingly more often, where the base Q learning model stayed relatively flat.


In both the “discrete stochastic decision process” and “Montezuma’s Revenge,” experiments, the authors demonstrate how the addition of intrinsic goal selection (via a meta-controller) can coax an agent into learning policies that yield higher rewards in environments with space feedback, where traditional DQNs fail.

Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar

Author: Gunjan LakhlaniAuthor: Andriy KourinnyiAuthor: Alireza Darbehani
Gunjan LakhlaniGL wrote the draft for introduction
Andriy KourinnyiAK provided the explanations of the Results section
Alireza DarbehaniAD gathered the pieces of the draft, wrote method section and keeping the team on track


The goal of automatic machine learning (AutoML) is to automate the end-to-end process of applying a machine learning model to solve a given task. An important contribution to the AutoML world is DARPA’s Data Driven Discovery of Models (D3M) which has an overall goal to automate model discovery. This is done by performing either a grid search on the set of model hyperparameters or by applying some kind of genetic algorithm.

The AlphaD3M paper solves the problem of “pipeline synthesis,” that is, finding the optimal sequence of data transformations in order to obtain an optimal machine learning model. The authors use a reinforcement learning approach, and formulate the problem of pipeline synthesis for model discovery as a single-player game, the inspiration of which comes from AlphaZero. The main idea is that a player iteratively builds a machine learning pipeline by selecting from a fixed set of actions (which in this case happen to be insertion, deletion, and replacement of the pipeline parts). This approach provides the advantage of model interpretability and also leverages modern deep reinforcement learning technique called Monte-Carlo Tree Search (MCTS).

AlphaD3M progresses on the task of pipeline synthesis by using self play with iterative self improvement. It is particularly efficient at finding solutions to search problems in very high dimensional spaces. Pipeline synthesis is a data mining workflow of these primitive steps: pre-processing, feature extraction, feature selection, estimation, and post-processing. The architecture is shown in figure 1.

AlphaD3M architecture

Existing AutoML frameworks use any one of the following methods, individually: differentiable programming, tree search, evolutionary algorithms, and Bayesian optimization to find the best pipeline architecture for a specific dataset and task. Unlike the existing frameworks, AlphaD3M uses Neural Networks along with Monte-Carlo tree search, as mentioned previously. This combination creates an environment in which the neural network learns patterns in building the pipeline using the primitive components, and predicts action probabilities, and the tree search breaks down the problem into components and looks ahead for good solutions.


As mentioned above, AlphaD3M uses neural Networks and MC tree search for solving the Meta Learning problem using the Sequence Modeling method. We will discuss the specifics of the methodology below.

State Representation Algorithm

A pipeline with meta-data and a problem definition together are analogues to a gameboard environment. The Pipeline Encoding Algorithm represents the environment and possible actions in the AlphaD3M AutoML framework.

Pipeline Encoding Algorithm:
Given datasets DD, tasks TT, and a set of possible pipeline sequences S1,...,SnS_1,...,S_n from the available machine learning, and data pre and post processing primitives.

  • For each dataset DiD_i and task TjT_j
    1. Encode dataset DiD_i as meta data features f(Di)f(D_i).
    2. Encode task TjT_j.
    3. Encode the current pipeline at time tt by a vector StS_t.
    4. Encode action fa(St)f_a(S_t), so policy π\pi maps (f(Di),Tj,St)(f(D_i),T_j,S_t) to fa(S1),...,fa(Sn)f_a(S_1),...,f_a(S_n).

The Neural Network Architecture

AlphaD3M uses a LSTM (Long Short Term Memory) recurrent network architecture. The network ff with parameters θ\theta, estimates the probability of possible actions, P(s,a)P(s, a) as well as a value score v(s)v(s), for a given state ss:

fθ(s)=(P(s,a),v(s))f_\theta(s)=(P(s,a), v(s))

The network inputs are tuples (st,πt,et)(s_t,\pi_t,e_t), from the game of self play, where sts_t is the state at time tt, πt\pi_t is the policy estimated by MCTS, and ete_t is the actual pipeline evaluation at the end of the game. The output of the network, v(s)v(s), is the overall estimated evaluation score of the pipeline performance.

Loss Function

The loss function is based on matching the predicted model SS with the real world model RR, and the predicted result of pipeline valuevv with the real world evaluation of ee.

L(θ)=SlogR+(ve)2+αθ2+βS1L(\theta)=S\log R+(v-e)^2+\alpha||\theta||_2+\beta||S||_1

Monte Carlo Tree Search

The tree search takes the network output (P(s,a),v(s))(P(s,a),v(s)) and runs multiple simulations to find the best pipeline sequence RR with the best evaluation results. The search result improves upon the predicted result SS, which is a pipeline sequence, by improving the network policy using the following update rule:


Q(s,a)Q(s,a): the expected reward for action aa from state ss.
N(s,a)N(s,a): the number of times action aa was taken from state ss.
N(s)N(s): the number of time state ss was visited.
P(s,a)P(s,a): the estimate of the neural network for the probability of taking action aa from state ss. cc: is the constant which determines the amount of exploration.

For each step of the simulation, if:

  1. the current action aa and state ss maximize U(s,a)U(s,a) and
  2. this state does not yet exist in the neural network with estimates (P(s,a),v(s))(P(s,a),v(s)),

then we add the state. Otherwise, the model calls the tree search again recursively.

Next, RR is determined by running the generated pipeline against the data to solve the task. The resulting evaluation is ee. Therefore, the real world search has provided (R,e)(R,e), where RR is the real world model, constructed with machine learning primitives, and the real world evaluation, ee, of the model and the pipeline using primitives.


The results were compiled from running the algorithm on 313 tabular data sets, some of which are from OpenML. The data sets include 121 binary classification and 108 multi class classification problems, and 84 univariate regression tasks.

In order to compare AlphaD3M’s performance to that of sklearn’s SGD estimator baseline pipelines, the authors plot the results of applying the respective algorithms to the 180 classification datasets. This plot is shown in Figure 2. The cases where AlphaD3M performed better than sklearn is shown with green circles.

AlphaD3M vs SGD Performance

The results of classification task are significant, and show that AlphaD3M performs better for 75% of the datasets, while worse performance is found in just 7% of the datasets. While the performance of the remaining 18% are equally comparable. Next, evaluating the cross-validation performance of these 180 datasets used in the classification task for Alpha3DM versus the SGD baseline pipeline by the estimator types is plotted in Figure 3 (shown below) to show the performance split. Here again, AlphaD3M has better performance with diverse estimators.

AlphaD3M Performance

Comparing different AutoML methods to AlphaD3M, the authors chose to use Auto Sklearn, TPOT, and Autostacker to serve as a benchmark for AutoML systems on a set of OpenML datasets. The computed performance mean, and standard deviation for repeated evaluations can be seen in Figure 4.

Performance mean and standard deviation values

Alpha D3M once again shows that it is competitive against the other Auto ML methods.

AlphaD3M was implemented in PyTorch and used a GPU when training the neural network and used a CPU for MCTS. This had favourable impacts on running time as shown in Table 1, where AlphaD3M has a computation time that is an order of magnitude faster.

Running time comparison (in seconds and speedup factors)

Dataset/Method TPOT Autostacker AlphaD3M Speedup vs TPOT Speedup vs AS
breast cancer 3366 1883 460 7.3 4
hill valley 17951 8411 556 32.2 15.1
monks 1517 1532 348 4.3 4.3
pima 5305 1940 619 8.5 3.1
spectf 4191 1673 522 8 3.2
vehicle 16795 4010 531 31.6 7.5


AlphaD3M seems to be an efficient framework for automated machine learning. It takes a different approach than other AutoML frameworks such as TPOT, auto-stacker, and auto-sklearn. It leverages deep reinforcement learning to solve a high dimensional space problem of building the best machine learning pipeline for a specific task and dataset in a very efficient manner.

Financial series prediction using Attention LSTM

Author: Vladimir BlagojevicAuthor: Eric Djona FegnemAuthor: Tryambak Kaushik
Vladimir BlagojevicVB developed the framework to test different deep learning algorithms before they were used in reinforcement learning algorithm. VB reviewed the final manuscript.
Eric Djona FegnemEDF and TK together developed the transformer policy
Tryambak KaushikTK wrote the article describing the problem, the goals, general assumptions about the reinforcement learning (RL) stock trading environment and the different RL policies modeled in the current work. TK adapted an existing stock trading bot for three policies (RNN, Attention LSTM and Weighted LSTM), and completed experiments with trend predictions for all policies. EDF and TK together developed the transformer policy.


Stock market fluctuations reflect market sentiment, analysts’ evaluations, and economic activity in general. However, such characteristic behaviour increases investment risks, and under uncertain market conditions it may be difficult for institutions and businesses to raise capital. A decrease in investment and economic activity impacts unemployment and poverty rates, which is certainly problematic. Fund managers use both fundamental and technical analysis to quantify these risks and increase investor confidence. While fundamental analysis generally involves evaluating the financial statements of a listed company (e.g. considering revenue, free cash flow, market capitalization, etc), technical analysis is restricted to understanding stock price movements and patterns as shown on charts.

With the recent advancements in data science and machine learning, there is growing demand to leverage machine computing power to improve technical analysis (like every other business). Reinforcement Learning (RL) is a suite of machine learning algorithms with an ability to learn (or re-inforce) from near-past and/or expected performances. Consequently, RL is believed to have significant potential to improve technical analysis for stock market analysis. In the current work, we compare the performance of two different RL algorithms for predicting price movements of the S&P 500, a stock market index that measures the performance of the 500 largest publicly traded companies in the United States.

RL consists of three components: the environment, state, action and rewards. The environment in this setting is stock fluctuations, or movements of stock prices with time. The state is the current situation within the larger environment where the algorithm has to make a decision (or action) to move to one of several possible future states. The action is to buy or sell, in order to make a profit, or a loss in the stock trading environment. A combination of actions and states results to RL rewards. In order to maximize rewards among several such combinations, RL uses a policy framework.

There are different policy frameworks available for RL policy design. Actor-critic is one such framework. In this framework, a ‘critic’ algorithm estimates a state-value or action value of the environment and uses it to direct the parameters of ‘actor’ algorithm. The ‘actor’ algorithm is the main policy framework which is trained to maximize desired desired RL outcomes. In the current work, both algorithms were implemented with deep neural networks, as on-policy where updates were performed according to the agent’s policy.

The Gym toolkit from OpenAI was used to set up the trading environment and policy. The Gym is modular and offers several predefined wrappers to implement and test different Reinforcement Learning architectures. The detailed information about Gym is available at In our current work, we use Gym to compare the performance of 3 reinforcement learning policies: RNN, Transformer, Attention LSTM and Weighted LSTM architectures.

RNN Architecture

A recurrent neural network (RNN) is a deep neural network that consists of 3 parts:a linear feed forward function, an activation function, and feedback (recurrent) layer from the previous step. RNN is thus a sequential architecture, i.e., data positioned later in the input sequence is processed only after the preceding data. In sequential problems such as time series analysis, this helps to preserve the memory of the time series. A popular type of RNN, called a Gated Neural Network (GNN) is used in the current project. GNNs use update and reset gates to store values for a certain period of time, and retrieve these values as necessary. Consequently, the model learns to decide the amount of information to purge. This prevents the vanishing gradient problem which occurs in standard RNNs. Details of this concept can be found in the following blog post.

In our RNN policy implementation, 4 layers were defined. These four layers consisted of 2 linear layers followed by a convolution layer and GRU (gated recurrent network) layer.

Transformer Architecture

RNNs are known to be difficult to parallelize on GPUs, due to their inherent sequential algorithm. Transformers, on the other hand, with their ability to analyse data in parallel, offer an excellent alternative. A standard Transformer popularly used for NLP problems primarily contains encoder, decoder, and attention layers, and a positional encoder. However in the current work, the decoder component of the Transformer is not implemented. In language translation problems, the encoder transforms the source language sentence to a pseudo-code and the decoder translates this pseudo-code back to the target language for interpretation of the end-user. Butfor this trading project the end-user does not need a translation back from pseudo-code to a target language. Hence, the encoder outputs can be directly used to take actions in RL environment, rendering the decoder redundant.

The Transformer was implemented as 2 layers (2 Transformer blocks) of 2 attention heads each, with a model embedding dimension of 50.

Attention LSTM

LSTMs are also a type of RNN which, similar to GRUs, help mitigate the vanishing gradient problem. An LSTM uses a hidden state and cell state to control the flow of information. The hidden state controls the flow of information over short time periods while the cell state of LSTM controls the information flow for long time periods, which leads to its name: Long Short Term Memory. Although an improvement over a vanilla RNN model, LSTM is not very efficient in capturing the relative importance of short and long time period memories for inputs and targets. LSTM representations are fixed length which can fail to capture all information from the input sequences, especially if the sequences are longer than the training sequences. An alternative solution to this problem is freeing the model from fixed length representations by introducing attention. Attention, as the name suggests, allows the models to be objectively attentive to near and far data models helps to address improve This model builds on LSTM with attention network added. Further details of the architecture are available at

Weighted LSTM

This model is similar to Attention LSTM and differed only in the assignment of weight, ie, AttentionLSTM was modified by multiplying the loss functions with a weight. This weight was the ratio of opening to closing stock price of a given day.

S&P 500 stock prices were divided into training and testing data. The training data (January 1, 2014 to August 31, 2018) was used to train the RL policies, and the testing data (September 1, 2018 to August 31, 2019) was used to test the model performance. The policies were trained for 4000 “episodes” on the training dataset. An “episode” is a collection of all states (in our case, each day is a state, where 3 actions can be taken: buy, sell or do nothing) from the start to the end of the trading dataset. As this work is a proof of concept, the effect of changing the number of episodes was not evaluated in the work. Similarly, the starting capital (assumed as $15000), starting number of shares (assumed as 100), number of strides (i.e., number of days between successive trades was set to 4), lookback period (i.e., number of days from the current day governing the reward of the action, set to 60 days), different algorithm parameters (number of layers, number of attention heads, embedding size, etc.) were not changed for simplicity of discussion and analysis.

Profit, defined as the difference of starting portfolio value from the ending portfolio value on a day, was used as the output metric. An episode will have positive profit, negative profit or zero profit depending on the combination of state and actions within the state. The total sum of the profit of all trades of an episode was used to compare the performance of different policies. Higher profit is a measure of better policy performance. However, this metric for RL policy predictions was found to be difficult to reproduce. For ease of comparing the results from different RL policies, the profits for 100 iterations from each policy was evaluated and normalised to allow for relative comparison. Iterating the predictions 100 times for the final comparison is expected to eliminate sampling bias. Normalised profit was defined as the difference of profit from the target policy to RNN policy for each iteration. The final results are presented in the figure below.


Reinforcement Learning Predicted Normalized Profits - Train Data Results Reinforcement Learning Predicted Normalized Profits - Train Data Results

Reinforcement Learning Predicted Normalized Profits - Test Data Results Reinforcement Learning Predicted Normalized Profits - Test Data Results

As can be seen from the output plots for train and test data, the RL policy performance does not show any significant performance advantage among RNN, Transformer and Attention LSTM. However, Weighted LSTM is generally found to perform better than Attention LSTM for both test and training data. Secondly, the defined profit metric was helpful in relative comparison of the different deep learning algorithms but needs improvement to reflect the inherent randomness in the predictions more representatively.

Although the analysis presented here is very limited in its scope, as it only considers one entity (the S&P 500) over a defined period, yet the results give an insight into the relative performance of different deep learning algorithms. Further evaluation and in-depth analysis of the effects of different RL environment and policy parameters on the financial stock market prediction must be done, especially in the definition of performance metrics.


The above analysis is only for education and research purposes and authors do not claim any liability for its commercial implementation.


Papers on advanced Algorithms:

Applied RL papers: