By: Tom Breloff
Re-posted from: http://www.breloff.com/layernorm/
Layer Normalization is a technique developed by Ba, Kiros, and Hinton for normalizing neural network layers as a whole (as opposed to Batch Normalization and variants which normalize per-neuron). In this post I’ll show my derivation of analytical gradients for Layer Normalization using an online/incremental weighting of the estimated moments for the layer.
Background and Notation
Training deep neural networks (and likewise recurrent networks which are deep through time) with gradient descent has been a difficult problem, partially (mostly) due to the issue of vanishing and exploding gradients. One solution is to normalize layer activations, and learn the skew (b) and scale (g) as part of the learning algorithm. Online layer normalization can be summed up as learning parameter arrays g and b in the learnable transformation:
The vector a is the input to our LayerNorm layer and the result of a Linear transformation of x. We keep a running mean ($\mu_t$) and standard deviation ($\sigma_t$) of a using a time-varying weighting factor ($\alpha_t$).
Derivation
Due mostly to LaTeX-laziness, I present the derivation in scanned form. A PDF version can be found here.
Summary
Layer normalization is a nice alternative to batch or weight normalization. With this derivation, we can include it as a standalone learnable transformation as part of a larger network. In fact, this is already accessible using the nnet
convenience constructor in Transformations:
using Transformations
nin, nout = 3, 5
nhidden = [4,5,4]
t = nnet(nin, nout, nhidden, :relu, :logistic, layernorm = true)
Network:
Chain{Float64}(
Linear{3-->4}
LayerNorm{n=4, mu=0.0, sigma=1.0}
relu{4}
Linear{4-->5}
LayerNorm{n=5, mu=0.0, sigma=1.0}
relu{5}
Linear{5-->4}
LayerNorm{n=4, mu=0.0, sigma=1.0}
relu{4}
Linear{4-->5}
LayerNorm{n=5, mu=0.0, sigma=1.0}
logistic{5}
)