Author Archives: Tom Breloff

Transformations: Modular tensor computation graphs in Julia

Re-posted from: http://www.breloff.com/transformations/

In this post I’ll try to summarize my design goals with the Transformations package for the JuliaML ecosystem. Transformations should be seen as a modular and higher-level approach to building complex tensor computation graphs, similar to those you may build in TensorFlow or Theano. The major reason for designing this package from the ground up lies in the flexibility of a pure-Julia implementation for new research paths. If you want to apply convolutional neural nets to identify cats in pictures, this is not the package for you. My focus is in complex, real-time, and incremental algorithms for learning from temporal data. I want the ability to track learning progress in real time, and to build workflows and algorithms that don’t require a gpu server farm.

What is a tensor computation graph?

A computation graph, or data flow graph, is a representation of math equations using a directed graph of nodes and edges. Here’s a simple example using Mike Innes’ cool package DataFlow. I’ve built a recipe for converting DataFlow graphs to PlotRecipes graphplot calls. See the full notebook for complete Julia code.

We’ll compute $f(x) = w * x + b$:

g = @flow f(x) = w * x + b
plot(g)

The computation graph is a graphical representation of the flow of mathematical calculations to compute a function. Follow the arrows, and do the operations on the inputs. First we multiply w and x together, then we add the result with b. The result of the addition is our output of the function f.

When x/w/b are numbers, this computation flow is perfectly easy to follow. But when they are tensors, the graph is much more complicated. Here’s the same example for a 1D, 2-element version where w is a 2×2 weight matrix and x and b are 2×1 column vectors:

plot(@flow f(x) = out(w11*x1 + w12*x2 + b1, w21*x1 + w22*x2 + b2))

Already this computational graph is getting out of hand. A tensor computation graph simply re-imagines the vector/matrix computations as the core units, so that the first representation ($f(x) = w * x + b$) is used to represent the tensor math which does a matrix-vector multiply and a vector add.

Transformation Graphs

Making the jump from computation graph to tensor computation graph was a big improvement in complexity and understanding of the underlying operations. This improvement is the core of frameworks like TensorFlow. But we can do better. In the same way a matrix-vector product

$(W*x)_i = \sum_j W_{ij} x_j$

can be represented as simply the vector $Wx$, we can treat the tensor transformation $f(x) = wx + b$ as a black box function which takes input vector (x) and produces output vector (f(x)). Parameter nodes (w/b) are considered learnable parameters which are internal to the learnable transformation. The new transformation graph looks like:

plot(@flow f(x) = affine(x))

Quite the improvement! We have created a modular, black-box representation of our affine transformation, which takes a vector input, multiplies by weight vector and adds a bias vector, producing a vector output:

And here’s the comparison for a basic recurrent network:

g = @flow function net(x)
    hidden = relu( Wxh*x + Whh*hidden + bh )
    y = logistic( Why*hidden + Wxy*x + by )
end

But… TensorFlow

The unfortunate climate of ML/AI research is: “TensorFlow is love, TensorFlow is life”. If it can’t be built into a TF graph, it’s not worth researching. I think we’re in a bit of a deep learning hype bubble at the moment. Lots of time and money is poured into hand-designed networks with (sometimes arbitrary) researcher-chosen hyperparameters and algorithms. Your average high school student can install and build a neural network to perform quite complex and impressive models. But this is not the path to human-level intelligence. You could represent the human brain by a fully connected deep recurrent neural network with a billion neurons and a quintillion connections, but I don’t think NVIDIA has built that GPU yet.

I believe that researchers need a more flexible framework to build, test, and train complex approaches. I want to make it easy to explore spiking neural nets, dynamically changing structure, evolutionary algorithms, and anything else that may get us closer to human intelligence (and beyond). See my first post on efficiency for a little more background on my perspective. JuliaML is my playground for exploring the future of AI research. TensorFlow and competitors are solutions to a very specific problem: gradient-based training of static tensor graphs. We need to break the cycle that research should only focus on solutions that fit (or can be hacked into) this paradigm. Transformations (the Julia package) is a generalization of tensor computation that should be able to support other paradigms and new algorithmic approaches to learning, though the full JuliaML design is one which empowers researchers to approach the problem from any perspective they see fit.

Visualizing Graphs in Julia using Plots and PlotRecipes

By: Tom Breloff

Re-posted from: http://www.breloff.com/Graphs/

In this short post, I hope to introduce you to basic visualization of graphs (nodes connected by edges) when using Plots in Julia. The intention is that Visualizing a graph is as simple as inputting the connectivity structure, and optionally setting a ton of attributes that define the layout, labeling, colors, and more. Nodes are markers, and edges are lines. With this understanding, we can apply common Plots attributes as we see fit.

Setup

First, you’ll want to get a working setup of Plots and PlotRecipes:

# for pkg in ("Plots","PlotRecipes")
#     Pkg.add(pkg)
#     Pkg.checkout(pkg)
# end
using PlotRecipes

# we'll use the PyPlot backend, and set a couple defaults
pyplot(alpha=0.5, size=(800,400))

Type Trees

For our example, we’re going to build a graph of the type hierarchy for a Julia abstract type. We will look at the Integer abstraction, and view it at the center of all supertypes and subtypes of Integer. You can view this demo as a Jupyter notebook.

First, we’ll create a vector of our chosen type (Integer) and all its supertypes (Real, Number, and Any):

T = Integer
sups = [T]
sup = T
while sup != Any
    sup = supertype(sup)
    unshift!(sups,sup)
end

Next we will build the graph connectivity and node labels. source and destiny are lists of integers representing the indices in the edge connections of “source” to “destination”.

n = length(sups)
nodes, source, destiny = copy(sups), collect(1:n-1), collect(2:n)
function add_subs!(T, supidx)
    for sub in subtypes(T)
        push!(nodes, sub)
        subidx = length(nodes)
        push!(source, supidx)
        push!(destiny, subidx)
        add_subs!(sub, subidx)
    end
end
add_subs!(T, n)
names = map(string, nodes)

Now we will use the connectivity (source and destiny) and the node labels (names) to visualize the graphs.

graphplot

The graphplot method is a user recipe defined in PlotRecipes. It accepts many different inputs to describe the graph structure:

source and destiny lists, with optional weights for weighted edges
adjlist: a vector of int-vectors, describing the connectivity from each node
adjmat: an adjacency matrix
LightGraphs.Graph: if LightGraphs is installed

We will use the source/destiny/weights method in this post. Lets see what it looks like without overriding default settings:

graphplot(source, destiny)

Cool. Now lets add names to the nodes. Notice that the nodes take on a hexagonal shape and expand to fix the text. There are a few additional attributes you can try: fontsize, nodeshape, and nodesize.

graphplot(source, destiny, names=names)

We can also change the layout of the nodes by:

using one of the built-in algorithms (spectral, stress, or tree)
extending with NetworkLayout
passing an arbitrary layout function to the func keyword
overriding the x/y/z coordinates yourself (graphplot(..., x=x, y=y)).

As there’s a clear hierarchy to our graph, lets use the tree method:

graphplot(source, destiny, names=names, method=:tree)

The tree layout allows the additional setting of the root of the tree. Lets make the graph flow from left to right:

graphplot(source, destiny, names=names, method=:tree, root=:left)

All too easy. Finally, lets give it some color. Remember that we’re building a generic Plots visualization, where nodes are markers and edges are line segments. For more info on Plots, please read through the documentation.

weights = linspace(1,2,length(source))
graphplot(source, destiny, weights,
    names = names, method = :tree,
    l = (2, cgrad()),  # apply the default color gradient to the line (line_z values taken from edge weights)
    m = [node==T ? :orange : :steelblue for node in nodes]     # node colors
)

Summary

Visualizing graphs with PlotRecipes is pretty simple, and it’s easy to customize to your hearts content, thanks to the flexibility of Plots. In a future post, I’ll use this functionality to view and visualize neural networks using my (work in progress) efforts within JuliaML and other projects.

Deep Reinforcement Learning with Online Generalized Advantage Estimation

By: Tom Breloff

Re-posted from: http://www.breloff.com/DeepRL-OnlineGAE/

Deep Reinforcement Learning, or Deep RL, is a really hot field at the moment. If you haven’t heard of it, pay attention. Combining the power of reinforcement learning and deep learning, it is being used to play complex games better than humans, control driverless cars, optimize robotic decisions and limb trajectories, and much more. And we haven’t even gotten started… Deep RL has far reaching applications in business, finance, health care, and many other fields which could be improved with better decision making. It’s the closest (practical) approach we have to AGI. Seriously… how cool it that? In this post, I’ll rush through the basics and terminology in standard reinforcement learning (RL) problems, then review and extend work in Policy Gradient and Actor-Critic methods to derive an online variant of Generalized Advantage Estimation (GAE) using eligibility traces, which can be used to learn optimal policies for our Deep RL agents.

Background and Terminology

The RL loop: state, action, reward (Image from Sutton & Barto)

The RL framework: An agent senses the current state (s) of its environment. The agent takes an action (a), selected from the action set (A), using policy (π), and receives an immediate reward (r), with the goal of receiving large future return (R).

That’s it. Everything else is an implementation detail. The above paragraph can be summed up with the equations below, where $\sim$ means that we randomly sample a value from a probability distribution, and the current discrete time-step is $t$.

$\begin{align} a_t &\sim \pi(a_t \mid s_t) \\ a_t &\in A_t \\ r_t, s_{t+1} &\sim p(r_t, s_{t+1} \mid s_t, a_t) \\ \tau_t &= \{ s_t, a_t, r_t, s_{t+1}, a_{t+1}, r_{t+1}, ~ ... \} \\ \tau &= \tau_0 \\ R(\tau_t) &= \sum_{l=0}^\infty{\gamma^l r_{t+l}} \end{align} %]]>$

The policy is some function (usually parameterized, sometimes differentiable) which uses the full history of experience of the agent to choose an action after presentation of the new state of the environment. For simplicity, we’ll only consider discrete-time episodic environments, though most of this could be extended or approximated for continuous, infinite-horizon scenarios. We will also only consider the function approximation case, where we will approximate policies and value functions using neural networks that are parameterized by a vector of learnable weights $Θ$.

The difficulty of reinforcement learning, as compared to other sub-fields of machine learning, is the Credit Assignment Problem. This is the problem that there are many many actions which lead to any given reward (and many rewards resulting from a single action), and it’s not easy to pick out the “important actions” which led to the good (or bad) reward. With infinite resources and enough samples, the statistics will reveal the truth, but we’d like to make sure an algorithm converges in our lifetime.

Approaches to RL

The goal with all reinforcement learners is to learn a good policy which will take the best actions in order to generate the highest return. This can be accomplished a few different ways.

Value iteration, temporal difference (TD) learning, and Q-learning approximate the value of states $V(s_t)$ or state-action pairs $Q(s_t, a_t)$, and use the “surprise” from incorrect value estimates to update a policy. These methods can be more flexible for learning off-policy, but can be difficult to apply effectively for continuous action spaces (such as robotics).

An alternative approach, called Policy Gradient methods, model a direct mapping from states to actions. With a little math, one can compute the effect of a parameter $\theta_i$ on the future cumulative returns of the policy. Then to improve our policy, we “simply” adjust our parameters in a direction that will increase the return. In truth, what we’re doing is increasing the probability of choosing good actions, and decreasing the probability of choosing bad actions.

I put “simply” in quotes because, although the math is reasonably straightforward, there are a few practical hurdles to overcome to be able to learn policies effectively. Policy gradient methods can be applied to a wide range of problems, with both continuous and discrete action spaces. In the next section we’ll see a convenient theorem that makes the math of computing parameter gradients tractable.

We can combine value iteration and policy gradients using a framework called Actor-Critic. In this, we maintain an actor which typically uses a policy gradient method, and a critic, which typically uses some sort of value iteration. The actor never sees the actual reward. Instead, the critic intercepts the rewards from the trajectory and critiques the chosen actions of the actor. This way, the noisy (and delayed) reward stream can be smoothed and summarized for better policy updates. Below, we will assume an Actor-Critic approach, but we will focus only on how to update the parameters of the actor.

Policy Gradient Theorem

I’d like to briefly review a “trick” which makes policy gradient methods tractable: the Policy Gradient Theorem. If we assume there is some mapping of any state/action pair to a real-valued return:

$f: S \times A \rightarrow \mathbb{R} \\ x \in (S,A) \\ R = f(x)$

then we’d like to compute the gradient of $\theta$ with respect to the expected total return $E_x[f(x)]$:

$\begin{align} \nabla_\theta E_x[f(x)] &= \frac{\partial E_x[f(x)]}{\partial \theta} \\ & = \nabla_\theta \int f(x) ~ p(x \mid \theta) ~ dx \\ & = \int f(x) ~ \nabla_\theta p(x \mid \theta) ~ \frac{p(x \mid \theta)}{p(x \mid \theta)} ~ dx \\ & = \int f(x) ~ \nabla_\theta log ~ p(x \mid \theta) ~ p(x \mid \theta) ~ dx \\ & = E_x[f(x) ~ \nabla_\theta log ~ p(x \mid \theta)] \end{align} %]]>$

We bring the gradient inside the integral and divide by the probability to get the gradient of log probability (grad-log-prob) term, along with a well-formed expectation. This trick means we don’t actually need to compute the integral equation in order to estimate the total gradient. Additionally, it’s much easier to take the gradient of a log as it decomposes into a sum of terms. This is important, because we only need to sample rewards and compute $\nabla_\theta log ~ p(x \mid \theta)$, which is dependent solely on our policy and the states/rewards that we see. There’s no need to understand or estimate the transition probability distribution of the underlying environment! (this is called “model-free reinforcement learning”.) Check out John Schulman’s lectures for a great explanation and more detail.

Improvements to the policy gradient formula

Now we have this easy-to-calculate formula to estimate the gradient, so we can just plug in the formulas and feed it into a gradient descent optimization, and we’ll magically learn optimal parameters… right? Right?!?

Sadly, due to weak correlation of actions to rewards resulting from the credit assignment problem, our gradient estimates will be extremely weak and noisy (the signal to noise ratio will be small, and variance of the gradient will be high), and convergence will take a while (unless it diverges).

One improvement we can make is to realize that, when deciding which actions to select, we only care about the relative difference in state-action values. Suppose we’re in a really awesome state, and no matter what we do we’re destined to get a reward of at least 100, but there’s a single action which would allow us to get 101. In this case we care only about that difference: $101 - 100 = 1$ .

This is the intuition behind the advantage function $A^\pi(s,a)$ for a policy $\pi$. It is defined as the difference between the state-action value function $Q^\pi(s,a)$ and the state value function $V^\pi(s)$:

$\begin{align} V^\pi(s_t) &= E^\pi[R(\tau_t) \mid s_t] \\ Q^\pi(s_t, a_t) &= E^\pi[R(\tau_t) \mid s_t, a_t] \\ A^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t) \end{align} %]]>$

Going back to the policy gradient formula, we can subtract a zero-expectation baseline $b_t(s_t)$ from our score function $f(x)$ without changing the expectation. If we choose the score function to be the state-action value function $Q^\pi(s,a)$, and the baseline to be the state value function $V^\pi(s)$, then the policy gradient $g$ is of the form:

$\begin{align} g = E[\sum_{t=0}^\infty{A^\pi(s_t, a_t) ~ \nabla_\theta log ~ \pi^\theta(a_t \mid s_t)}] \end{align}$

The intuition with this formula is that we wish to increase the probability of better-than-average actions, and decrease the probability of worse-than-average actions. We use a discount factor $\gamma$ to control the impact of our value estimation. With $\gamma$ near 0, we will approximate the next-step return. With $\gamma$ near 1, we will approximate the sum of all future rewards. If episodes are very long (or infinite), we may need $\gamma < 1$ for tractibility/convergence.

This formula is an example of an Actor-Critic algorithm, where the policy $\pi$ (the actor) adjusts its parameters by using the “advice” of a critic. In this case we use an estimate of the advantage to critique our policy choices.

Generalized Advantage Estimator (GAE)

Schulman et al use a discounted sum of TD residuals:

$\begin{align} \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t) \end{align}$

and compute an estimator of the k-step discounted advantage:

$\begin{align} \hat{A}_t^{(k)} = \sum_{l=0}^{k-1}{\gamma^l \delta_{t+l}^V} \end{align}$

Note: it seems that equation 14 from their paper has an incorrect subscript on $\delta$.

They define their generalized advantage estimator (GAE) as the weighted average of the advantage estimators above, which reduce to a sum of discounted TD residuals:

$\begin{align} \hat{A}_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^\infty{(\gamma\lambda)^l \delta_{t+l}^V} \end{align}$

This generalized estimator of the advantage function allows a trade-off of bias vs variance using the parameter $0 \leq \lambda \leq 1$, similar to TD(λ). For $\lambda = 0$, the problem reduces to the (unbiased) TD(0) function. As we increase $\lambda$ towards 1, we reduce the variance of our estimator but increase the bias.

Online GAE

We’ve come a long way. We now have a low(er)-variance approximation to the true policy gradient. In github speak: :tada: But for problems I care about (for example high frequency trading strategies), it’s not very practical to compute forward-looking infinite-horizon returns before performing a gradient update step.

In this section, I’ll derive an online formula which is equivalent to the GAE policy gradient above, but which uses eligibility traces of the inner gradient of log probabilities to compute a gradient estimation on every reward, as it arrives. Not only will this prove to be more efficient, but there will be a massive savings in memory requirements and compute resources, at the cost of a slightly more complex learning process. (Luckily, I code in Julia)

Notes: See Wawrzyński et al for an alternate derivation. These methods using eligibility traces are closely related to REINFORCE.

Lets get started. We are trying to reorganize the many terms of the policy gradient formula so that the gradient is of the form: $g = E^\pi[\sum_{t=0}^\infty{r_t \psi_t}]$, where $\psi_t$ can depend only on the states, actions, and rewards that occurred before (or immediately after) the arrival of $r_t$. We will solve for an online estimator of the policy gradient $\hat{g}$:

$\begin{align} \hat{g} &= \sum_{t=0}^\infty{\hat{A}_t^{GAE(\gamma,\lambda)} ~ \nabla_\theta log ~ \pi^\theta(a_t \mid s_t)} \\ & = \sum_{t=0}^\infty{ \nabla_\theta log ~ \pi^\theta(a_t \mid s_t) ~ \sum_{l=0}^\infty{(\gamma\lambda)^l \delta_{t+l}^V}} \end{align} %]]>$

In order to simplify the derivation, I’ll introduce the following shorthand:

$\begin{align} \nabla_t := \nabla_\theta log ~ \pi^\theta(a_t \mid s_t) \end{align}$

and then expand the sum:

$\begin{align} \hat{g} &= \nabla_0 (\delta_0^V + (\gamma\lambda)\delta_1^V + (\gamma\lambda)^2\delta_2^V + ~ ... ) \nonumber \\ & ~~ + \nabla_1 (\delta_1^V + (\gamma\lambda)\delta_2^V + (\gamma\lambda)^2\delta_3^V + ~ ... ) \\ & ~~ + \nabla_2 (\delta_2^V + (\gamma\lambda)\delta_3^V + (\gamma\lambda)^2\delta_4^V + ~ ... ) \nonumber \\ & ~~ + ~ ... \nonumber \end{align} %]]>$

and collect the $\delta_t^V$ terms:

$\begin{align} \hat{g} &= \delta_0^V \nabla_0 \nonumber \\ & ~~ + \delta_1^V ( \nabla_1 + (\gamma\lambda)\nabla_0 ) \\ & ~~ + \delta_2^V ( \nabla_2 + (\gamma\lambda)\nabla_1 + (\gamma\lambda)^2\nabla_0 ) \nonumber \\ & ~~ + ~ ... \nonumber \end{align} %]]>$

and summarize:

$\begin{align} \hat{g} &= \sum_{t=0}^\infty{\delta_t^V \sum_{l=0}^t{(\gamma\lambda)^l \nabla_{t-l}}} \end{align} %]]>$

If we define our eligibility trace as the inner sum in that equation:

$\begin{align} \epsilon_t := \sum_{l=0}^t{(\gamma\lambda)^l \nabla_{t-l}} \end{align}$

and convert to a recursive formula:

$\begin{align} \epsilon_0 &:= \nabla_0 \\ \epsilon_t &:= (\gamma\lambda) \epsilon_{t-1} + \nabla_t \end{align} %]]>$

then we have our online generalized advantage estimator for the policy gradient:

$\begin{align} \hat{g} = \sum_{t=0}^\infty{\delta_t^V \epsilon_t} \end{align}$

So at each time-step, we compute the gradient term $\hat{g}_t = \delta_t^V \epsilon_t$ as the product of the TD(0) error from our critic and the accumulated log-prob gradients of our policy. We could update our parameters online at the end of each episode, or at each step, or in batches, or using some sort of smoothing method.

Summary

In this post I covered some basic terminology, and summarized the theorems and formulas that comprise Generalized Advantage Estimation. We extended GAE to the online setting in order to give us more flexibility when computing parameter updates. The hyperparameter $\gamma$ allows us to control our trust in the value estimation, while the hyperparameter $\lambda$ allows us to assign more credit to recent actions.

In future posts, I intend to demonstrate how to use formulas 26-28 in practical algorithms to solve problems in robotic simulation and other complex environments. If you need help solving difficult problems with data, or if you see mutual benefit in collaboration, please don’t hesitate to get in touch.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Tom Breloff

Transformations: Modular tensor computation graphs in Julia

What is a tensor computation graph?

Transformation Graphs

But… TensorFlow

Visualizing Graphs in Julia using Plots and PlotRecipes

Setup

Type Trees

graphplot

Summary

Deep Reinforcement Learning with Online Generalized Advantage Estimation

Background and Terminology

Approaches to RL

Policy Gradient Theorem

Improvements to the policy gradient formula

Generalized Advantage Estimator (GAE)

Online GAE

Summary