Author Archives: Tom Breloff

Plots Tutorial: Ecosystem and Pipeline

By: Tom Breloff

Re-posted from: http://www.breloff.com/plots-video/

Plots is a complex and powerful piece of software, with features and functionality that many probably don’t realize. In this video tutorial, I try to explain where Plots fits into the Julia landscape and how Plots turns a simple command into a beautiful visualization.


If you have questions or want to request video tutorials on other topics, come chat.

JuliaML Transformations: Internal Design

By: Tom Breloff

Re-posted from: http://www.breloff.com/transformations-video-internals/

In this video post, I expand on my introduction to Transformations and show the core idea behind the design: namely that each transformation has a black-box representation of input, output, and (optionally) parameters which are vectors in contiguous storage. Julia’s excellent type system and efficient array views allow for very convenient and intuitive structures.


For questions, comments, or if you’re interested in collaborating, please join the JuliaML Gitter chat.

Online Layer Normalization: Derivation of Analytical Gradients

By: Tom Breloff

Re-posted from: http://www.breloff.com/layernorm/

Layer Normalization is a technique developed by Ba, Kiros, and Hinton for normalizing neural network layers as a whole (as opposed to Batch Normalization and variants which normalize per-neuron). In this post I’ll show my derivation of analytical gradients for Layer Normalization using an online/incremental weighting of the estimated moments for the layer.


Background and Notation

Training deep neural networks (and likewise recurrent networks which are deep through time) with gradient descent has been a difficult problem, partially (mostly) due to the issue of vanishing and exploding gradients. One solution is to normalize layer activations, and learn the skew (b) and scale (g) as part of the learning algorithm. Online layer normalization can be summed up as learning parameter arrays g and b in the learnable transformation:

The vector a is the input to our LayerNorm layer and the result of a Linear transformation of x. We keep a running mean ($\mu_t$) and standard deviation ($\sigma_t$) of a using a time-varying weighting factor ($\alpha_t$).

Derivation

Due mostly to LaTeX-laziness, I present the derivation in scanned form. A PDF version can be found here.

Summary

Layer normalization is a nice alternative to batch or weight normalization. With this derivation, we can include it as a standalone learnable transformation as part of a larger network. In fact, this is already accessible using the nnet convenience constructor in Transformations:

using Transformations
nin, nout = 3, 5
nhidden = [4,5,4]
t = nnet(nin, nout, nhidden, :relu, :logistic, layernorm = true)

Network:

Chain{Float64}(
   Linear{3-->4}
   LayerNorm{n=4, mu=0.0, sigma=1.0}
   relu{4}
   Linear{4-->5}
   LayerNorm{n=5, mu=0.0, sigma=1.0}
   relu{5}
   Linear{5-->4}
   LayerNorm{n=4, mu=0.0, sigma=1.0}
   relu{4}
   Linear{4-->5}
   LayerNorm{n=5, mu=0.0, sigma=1.0}
   logistic{5}
)