Author Archives: Oisin Fitzgerald

LMMs

By: Oisin Fitzgerald

Re-posted from: https://oizin.github.io/posts/linear-mixed-effects/index.html

Linear mixed effect models

Oisín Fitzgerald, December 2021

  1. Introduction
    1. Mixed effect models
    2. Linear mixed effect models
    3. Motivating example
  2. Estimation
    1. Theory
    2. Algorithm
    3. Julia implementation
    4. Example revisited
  3. Conclusion
  4. References

This is an introduction to linear mixed effect models. It is based on Simon Wood's book on generalised additive models and notes and articles by Douglas Bates, listed at the end. Code written in Julia.

A bit rough – comments welcome!

Introduction

Mixed effect models

Multilevel or mixed effect models are useful whenever our data contains repeated samples from the "statistical units" that make up our data. There is no fixed definition of a unit – what matters is that we consider it likely that the data within each unit is correlated. For example:

  • Repeated blood pressure measurements from a group of individuals. Each individual's blood pressure is likely to be highly correlated over time.

  • Assessment of student performance over several schools. Since each school has it's own set of teachers, policies, and enrolment area, student performance within a school may be correlated.

Why does this correlation matter? Well, if we have \(N\) units each with \(n\) measurements while our data contains \(N \times n\) observations we might actually have much closer to \(N\) pieces of independent information. This depends on the strength of the (positive) correlation within a unit. At the extreme, if we had a sample of people with extremely stable blood pressures, and we observe \(\text{person}\_1 = (121/80, 120/80,…)\), \(\text{person}\_2 = (126/78, 126/78,…)\) and so on then clearly you really only have \(~N\) pieces of independent information. Essentially all the information (variation) in the data is in the differences between units, rather than temporal changes within units (since these are small/nonexistent).

Below is an example of the number of cars per capita in certain countries over time using the gasoline dataset from the R package plm. Some noticeable facts about this dataset are 1) there is a clear difference in the number of cars between countries in the initial year of study (1960) 2) this initial difference is also far larger than the change within any one country over the time course of the study and 3) each country changes in a steady quite predictable fashion. The dataset contains other variables (income per capital and gas price) which may explain some of this variation in initial conditions and rate of change.

using CairoMakie, DataFrames, RDatasets, Statistics
df = dataset("plm", "Gasoline")
f = Figure(resolution = (800, 400))
ax = Axis(f[1,1], xlabel = "Year", ylabel = "Cars per capita (log scale)",
    title = "Variation at baseline and over time")
for country in unique(df.Country)
    msk = df.Country .== country
    lines!(ax,df.Year[msk],df.LCarPCap[msk],color = :lightblue)
end 
f
 Cars per capital for 18 countries (1960-1980)
Cars per capital for 18 countries (1960-1980)

What mixed effect models do is divide up the variation that exists in the data into several "buckets". At the highest level there is explained and unexplained variation. Explained variation is variation that is accounted for by your predictor features (covariates). These terms are often called fixed effects. For example, differences in blood pressure may be accounted for by differences in amount of salt intake or exercise quantity. Note that this can be both between and within units, two people may have different levels of average exercise quantity and one person may change their exercise quantity over time. Longitudinal data structures are very powerful in allowing us to examine difference in the effect of a variable both between and within units. For instance if we found that average exercise amount predicted a lowering in blood pressure but an individual increasing their exercise amount did not we might wonder whether 1) exercise was a proxy for something else or 2) does the change take a long time.

Looking again at the gasoline dataset, we can see that the number of cars per capita is higher in wealthier countries (the between country relationship), and also that as a country increases in wealth the number of cars per capita increases (the within country relationship). Indeed the within country relationship is quite clear and strong. In many cases (e.g. certain physiological signals) this relationship is often harder to discern due to the variation within units being of comparable size to "noise" factors such as measurement error and natural variation.

gdf = groupby(df,:Country)
mdf = combine(gdf, :LCarPCap => mean, :LIncomeP => mean)
df = leftjoin(df,mdf,on=:Country)
df.LIncomeP_change = df.LIncomeP - df.LIncomeP_mean
df.LCarPCap_change = df.LCarPCap - df.LCarPCap_mean
f = Figure(resolution = (800, 400))
ax1 = scatter(f[1, 1],mdf.LCarPCap_mean,mdf.LIncomeP_mean)
ax1.axis.xlabel = "Mean cars per capita (log scale)"
ax1.axis.ylabel = "Mean income per capita (log scale)"
ax1.axis.title = "Variation between"
ax2 = scatter(f[1, 2],df.LCarPCap_change,df.LIncomeP_change)
ax2.axis.xlabel = "Change in cars per capita (log scale)" 
ax2.axis.ylabel = "Change in income per capita (log scale)"
ax2.axis.title = "Variation within"
f
 Cars per capital and income per capita for 18 countries (1960-1980)
Cars per capital and income per capita for 18 countries (1960-1980)

Unexplained variation is any variation that cannot be explained by values of (or variation in) the covariates. It is here that we really see the usefulness of mixed effect models. This unexplained variation is decomposed into the unexplained variation between units and within units. The between unit variation (the random effects) are the selling point of mixed effect models. Rather than associate with each term in our model (e.g. the intercept) a single fixed effect we might associate a distribution of effects. This distribution might have small or large degree of variation depending on the extent of the relevant unexplained variation that exists between our units. A notable fact is that we can have between unit variation in any term within our model, for instance the units might differ in their baseline values, suggesting random intercepts. They might also differ in the effect of a particular variable (e.g. time, effect of a drug) giving a random slope. A cartoon version of a random intercept and random slope situation is shown below.

A summary of the decomposition of variance view:

The descriptions above suggests you might only have one "level" of units. However, multilevel models can account for many levels of hierarchical clustering. For example, measurements within patients within medical practises.

Linear mixed effect models

The main practical issue with mixed effect models is while we may be able to write down a model that accounts for the variation we believe exists in the data (e.g. following some exploratory data analysis) fitting it turns out to be much harder than standard linear models. The remainder of this post demonstrates the estimation process for linear mixed effects models. With the notation \(x\) / \(X\) refering to a vector / matrix, and \(x_i\) / \(X_{ij}\) the element of a matrix / vector, a linear mixed effects model (LMM) can be written as

\[y = X\beta + Z b + \epsilon, b \sim N(0,\Lambda_{\theta}), \epsilon \sim N(0,\Sigma_{\theta})\]

where

  • \(\beta \in \mathcal{R}^p\) are the fixed effects, analogous to the coefficients in a standard linear model.

  • \(X \in \mathcal{R}^{Nn \times p}\) is the model matrix for the fixed effects containing the covariates / features.

  • The random vector \(b\) contains the random effects, with zero expected value and covariance matrix \(\Lambda_{\theta}\)

  • \(Z \in \mathcal{R}^{Nn \times n}\) is the model matrix for the random effects

  • \(\Sigma_{\theta}\) is the residual covariance matrix. It is often assumed that \(\Sigma_{\theta} = \sigma^2 I\)

  • \(\theta\) is the variance-covariance components, a vector of the random effect and residual variance parameters.

Motivating example

Using the gasoline dataset consider modelling car ownership (per capita) as a function of time (year), income (per capita) and gas price (inflation adjusted).

f = Figure(resolution = (800, 600))
ax = Axis(f[1,1:2], xlabel = "Year", ylabel = "Cars per capita (log scale)",
    title = "Variation at baseline and over time")
for country in unique(df.Country)
    msk = df.Country .== country
    lines!(ax,df.Year[msk],df.LCarPCap[msk],color = :lightblue)
end 
ax1 = scatter(f[2, 1],df.LIncomeP,df.LCarPCap)
ax1.axis.ylabel = "Cars per capita (log scale)"
ax1.axis.xlabel = "Income per capita (log scale)"
ax2 = scatter(f[2, 2],df.LRPMG,df.LCarPCap)
ax2.axis.ylabel = "Gasoline price (log scale)"
ax2.axis.xlabel = "Income per capita (log scale)"
f
 Cars per capital compared to several factors for 18 countries (1960-1980)
Cars per capital compared to several factors for 18 countries (1960-1980)

As mentioned above a commonly used form of the LMM is the random intercept model. In this situation for a single level (country over time) the resulting model for country \(i\) at time \(j\) is

\[y_{ij} = \beta_0 + \beta_1 \text{year}_{ij} + \beta_2 \text{income}_{ij} + \beta_3 \text{gas}_{ij} + b_i + \epsilon_{ij}, b_i \sim N(0,\sigma_b^2), \epsilon_{ij} \sim N(0,\sigma_e^2)\]

I won't go into it's construction but it is worth thinking about what \(Z\) would look like in this case (if you run the code below you can print out \(Z\)), and how it would change it we added a time random effect. Doing this will give you a sense of how the size of \(Z\) can grow quite quickly while being a largely sparse matrix (filled with zeros).

Estimation

As mentioned LMMs are tricky to estimate, largely due to the presence of the unobserved random effects and additional need to decompose the outcome variance into several variance-covariance parameters. It is worth understanding the estimation process at least superficially as it can aid in debugging (commonly "why is my model taking so long to estimate!?") and understanding warning messages when using well tested packages written by others.

Theory

Feel free to skim this section. A common approach to estimation of LMMs is maximum likelihood estimation (MLE) or restricted MLE (REML) – but I'll just cover MLE here. As noted in Wood (2017) estimation of \(\beta\) and \(\theta\) could be based on the marginal distribution \(p(y)\) of the outcome

\[y \sim N(X\beta,Z^t\Lambda_{\theta}Z + \Sigma_{\theta})\]

however this would involve the inversion of a \(Nn \times Nn\) matrix \(Z^t\Lambda_{\theta}Z + \Sigma_{\theta}\). As a result estimation is generally based on the the expression

\[p(y) = \int p(y,b) db = \int p(y|b)p(b) db\]

It is worth listing out some the log pdf of the distributions that are going to come up in the derivation of a the final expression. The log transform is taken to remove the exponents and convert multiplication into addition. Here and below \(c_x\) denotes a normalising constant for the distribution of \(x\).

  • \(\text{log}p(y,b) = \text{log}p(y|b) + \text{log}p(b)\)

  • \(\text{log}p(y|b) = c_{y|b} + \text{log}|\Sigma_{\theta}| – (y – X\beta – Z b)^t \Sigma_{\theta}^{-1} (y – X\beta – Z b)\)

  • \(\text{log}p(b) = c_{b} + \text{log}|\Lambda_{\theta}| – b^t \Lambda_{\theta}^{-1} b\)

Now we are ready to derive the estimation equations. Let \(\hat{b}\) be the MLE of \(p(y,b)\). Then utilising a Taylor expansion of \(p(y,b)\) about \(\hat{b}\) on the second line below we have

\[\begin{aligned}
p(y) &= \int f(y,b) db = \int \text{exp}\{\text{log} p(y,b)\}db \\
&= \int \text{exp}\{\text{log} p(y,\hat{b}) + (b-\hat{b})^t \frac{\partial^2 \text{log} p(y,\hat{b})}{\partial b \partial b^t} (b-\hat{b})\}db \\
&= p(y,\hat{b}) \int \text{exp}\{-(b-\hat{b})^t (Z^t\Sigma_{\theta}^{-1} Z + \Lambda_{\theta}^{-1})(b-\hat{b})/2\}db \\
\end{aligned}\]

The term inside the integral can be recognised is an un-normalised Gaussian pdf with covariance \((Z^t\Sigma_{\theta}^{-1} Z + \Lambda_{\theta}^{-1})^{-1}\). The normalisation constant for this pdf would be \(\sqrt{|(Z^t\Sigma_{\theta}^{-1} Z + \Lambda_{\theta}^{-1})^{-1}| (2\pi)^{n}}\) and so, using the fact that \(|A^{-1}| = |A|^{-1}\) the result of the integral is

\[\begin{aligned}
p(y) &= p(y|\hat{b})p(\hat{b}) |(Z^t\Sigma_{\theta}^{-1} Z + \Lambda_{\theta}^{-1})^{-1}|^{-1/2} c_y \\
\end{aligned}\]

In practise we will work with the deviance (minus two times the log-likelihood), inputting our expressions for \(\text{log} p(y|b)\) and \(\text{log} p(b)\) from above gives the quantity to be minimised as

\[
d(\beta,\theta) = -2l(\beta,\theta) = (y – X\beta – Zb)^t \Sigma_{\theta}^{-1} (y – X\beta – Zb) + b^t \Lambda_{\theta}^{-1} b +
\text{log}|Z^t\Sigma_{\theta}^{-1} Z + \Lambda_{\theta}^{-1}| + c_{y|b} + c_{b} + c_y \\
\]

Computation can be based on the observation that for a fixed \(\theta\) we can get estimates of \(\beta\) and \(b\) using the first two terms

\[
d_1(\beta,\theta) = (y – X\beta – Zb)^t \Sigma_{\theta}^{-1} (y – X\beta – Zb) + b^t \Lambda_{\theta}^{-1} b
\]

Notice that the random effects (e.g. the individual intercept or feature effect) are shrinkage estimates of what we would get if we let every unit have it's own intercept or feature effect, hence the term penalised least squares.

Then estimates of \(\theta\) can be based on the profile likelihood (deviance) \(d_p(\theta) = d(\hat{\beta},\theta)\).

Some other observations:

  • Several of the terms can be interpreted as a complexity penalties on the random effects or variance-covariance parameters.

  • A nicer approach is to let \(b = \Gamma_{\theta} u\) where \(u\) is a spherical normal variable (uncorrelated equal variance) and \(\Lambda_{\theta} = \Gamma_{\theta}^t\Gamma_{\theta}\), reducing the dimension of \(\theta\) by one (Bates et al, 2014).

Algorithm

How does estimation go in practise? Often a gradient free optimisation algorithm (Nelder-Mead or BOBYQA) is used for \(d_p\).

  1. Inputs \(X\), \(Z\), \(y\), optimisation tolerance(s) \(\tau\)

  2. Initialise \(B^{(0)}\) = [\(\beta^{(0)}\),\(b^{(0)}\)] = \(0\), \(\theta^{(0)} = \bold{1}\),

  3. While \(\tau\) not met:

    1. \(B^{(k)}\): argmin \(d_1(\beta,\theta)\)

    2. \(\theta^{(k)}\): argmin \(d_p(\theta)\)

This is high level (but reasonable for understanding) view of how software packages like lme4 or MixedModel perform estimation for LMMs. See Bates et al (2015) for a detailed overview of the numerical linear algebra considerations in the implementations.

Julia implementation

For a more complete idea of how to code LMMs in practise see the source code for MixedModels.jl. The code below estimates \(\beta\) and the variance components \(\theta\).

## libraries
# linear algebra
using LinearAlgebra, SparseArrays
# optimisation
using Optim
import Statistics"""
Calculates log likelihood for LMM. 
Internally calculates fixed and random effects given estimates of the variance-covariance components, 
with modification of first three arguments βb, LL, rr. Designed for `lmm_fit`.Args
    βb  : vector of estimates of fixed and random effects
    D   : fixed and random effect design matrices
    DtD : D'D
    Dty : D'y
    y   : outcome vector
    logθ: log of variance-covariance components
    dim : tuple of dimensions"""
function loglik!(βb,D,DtD,Dty,y,logθ,dim)
    σ,σ_b = exp.(logθ)    # dimensions
    Nn,n,p = dim
    N = Nn/n    # estimation of \beta and b given theta
    diagf = diagm([repeat([0.0],p);repeat([1/σ_b^2],n)])
    LL = DtD ./ σ^2 + diagf
    βb[:] = LL \ (Dty ./ σ^2)    # -2 log likelihood (profile likelihood)
    logdetθ = logdet(DtD[(p+1):end,(p+1):end] ./ σ^2 + diagf[(p+1):end,(p+1):end])
    nll = (1/σ^2)*sum((y - D*βb).^2) + (1/σ_b^2)*sum(βb[(p+1):end].^2) + 2*logdetθ  + n*log(σ_b^2) + Nn*log(σ^2) + n*log(2*π)
    nll ./ 2
end"""
Estimate a LMMArgs
    X : Fixed effect design matrix
    Z : Random effect design matrix
    y : outcome
"""
function lmm_fit(X,Z,y)    # dimensions / data
    Nn = length(y)
    n = size(Z)[2]
    p = size(X)[2]
    dim = (Nn,n,p)
    D = [X Z]
    DtD = D'D
    Dty = D'y    # optimisation setup
    βb = zeros(n+p)
    θ0 = ones(2)    # optimise
    opt = optimize(var -> loglik!(βb,D,DtD,Dty,y,var,dim), log.(θ0), NelderMead())
    θ = exp.(Optim.minimizer(opt))    # output
    out = LMM(βb[1:p],θ,βb[(p+1):end],opt)
    out
end
"""
A struct to store the results of our LMM estimation
"""
struct LMM
    β
    θ
    b
    opt
end# A small test - the output should be approx [1.0,3.0]
N, n, p = 30, 100, 10
ids = repeat(1:n,inner=N)
X = [repeat([1.0],N*n) randn(N*n,p)]
β = randn(p+1)
θ2 = 3.0
b = sqrt(θ2) .* randn(n)
Z = sparse(kron(sparse(1I, n, n),repeat([1],N)))
y = X * β + Z * b + randn(N*n);
res = lmm_fit(X,Z,y);
println("Variance components: ",round.(res.θ .^ 2,digits=3))
Variance components: [1.06, 3.274]

Clearly it is still worth using MixedModels.jl but the benefit of being able to code it yourself is the freedom you get to make changes in the underlying algorithm and see the effects.

Example revisited

Estimating the car ownership model using lmm_fit gives the following results.

df.Time = df.Year .- 1965
n = length(unique(df.Country))
N = length(unique(df.Year))
X = [repeat([1.0],size(df)[1]) df.Time df.LIncomeP df.LRPMG]
Z = sparse(kron(sparse(1I, n, n),repeat([1],N)))
y = df.LCarPCap
res = lmm_fit(X,Z,y);
println("Variance components: ",round.(res.θ .^ 2,digits=3))
println("Fixed effects: ",round.(res.β,digits=4))
Variance components: [0.031, 1.619]
Fixed effects: [6.6648, -0.0123, 2.5672, -0.1986]

Estimating the car ownership model from above using MixedModels.jl gives the following results.

using MixedModels
m1 = fit(MixedModel, @formula(LCarPCap ~ 1 + Time + LIncomeP + LRPMG + (1|Country)), df)
println(m1)
Linear mixed model fit by maximum likelihood
 LCarPCap ~ 1 + Time + LIncomeP + LRPMG + (1 | Country)
   logLik   -2 logLik     AIC       AICc        BIC    
    57.5230  -115.0460  -103.0460  -102.7952   -80.0371Variance components:
            Column   Variance Std.Dev. 
Country  (Intercept)  1.627124 1.275588
Residual              0.028976 0.170223
 Number of obs: 342; levels of grouping factors: 18  Fixed-effects parameters:
────────────────────────────────────────────────────
                  Coef.  Std. Error      z  Pr(>|z|)
────────────────────────────────────────────────────
(Intercept)   6.69453    0.727785     9.20    <1e-19
Time         -0.0124492  0.00439073  -2.84    0.0046
LIncomeP      2.57169    0.104949    24.50    <1e-99
LRPMG        -0.195353   0.0807133   -2.42    0.0155
────────────────────────────────────────────────────

The results from the two approaches are similar, the minor differences can be attributed to use of different optimisation routines. Interpreting the results it looks like income is the most important factor in predicting increased car ownership. Gas prices decreasing and temporal trends are noiser seconds. Indeed the sign for time is negative which may be a result of some collinearity due to income and time increasing together. The intercept random effect still has reasonably large variation, although it is clearly smaller than what we would expect if time was the only covariate (see the first figure).

Conclusion

We've covered the background of why you might use mixed effects models, along with the estimation of linear mixed effects models. Some other interesting topics worth exploring are the estimation of generalised linear mixed effects models, and a comparison with taking a Bayesian approach to model estimation. Thanks for reading! :)

References

This document borrows from the following:

  • Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press.

  • Bates, D., Mächler, M., Bolker, B., & Walker, S. (2014). Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823.

  • MixedModels.jl: https://juliastats.org/MixedModels.jl/dev/

Gaussian processes and linear regression

By: Oisin Fitzgerald

Re-posted from: https://oizin.github.io/posts/gp-linear/index.html

Gaussian processes and linear regression

Oisín Fitzgerald, May 2021

A look at section 6.4 of:

Bishop C.M. (2006). Pattern recognition and machine learning. Springer.

https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/

Basically this post goes through (Bayesian) linear regression from a Gaussian process space point of view with some example Julia code to make things concrete.

Update (10/11/2021): Deleted "estimating the hyperparameters" section for now as it was too short and had no examples.

Overview

The dominant approach to solving regression problems in machine learning today is finding the parameters \(w\) of a model \(M_w\) that minimise a loss function \(L\) by optimally combining a set of basis vectors. These basis vectors can be the original data \(x_n\) or some transformation \(z_n = \phi(x_n)\) where \((y_n,x_n)\) is the \(n^{th}\) output-input pair \(n \in \{1,…,N\}\) and \(x_n\) is length \(p\) (the number of features). For example:

  • Linear regression: find the best set of weights \(w\) that minimise mean square error \(\left\Vert Y – X w \right\Vert\) giving us predictions \(y_n = w^t x_n\).

  • Deep learning: at the other extreme of complexity we can think of deep learning as learning both the basis vectors and the weights. In a network with \(L\) layers the outputs and weights of the final layer are \(z_L\) and \(w_L\) giving us \(y_n = w_L^t z_L(x_n)\).

With Gaussian processes with are going to switch from thinking in terms of locating which parameters are most likely to have generated the data to considering the data a finite sample from a function that has particular properties. The parameters and function space viewpoint are not conflicting, for example for linear regression:

  1. Parameter space view: \(y\) is a combination of basis functions with the weights being from a mltivariate normal distribution.

  2. Function space view: \(y(x_n)\) is a sample from a family of functions where any finite sample of points \(\{y_1,…,y_N\}\) follow a multivariate normal distibution.

From the parameter to function space view

To fully see the connection let's go from the parameter space view to the function space view for linear regression. The model is

\[y(x_n) = w^t x_n\]

In matrix form the above is written as \(Y = X w\), with each row of the \(N \times p\) matrix \(X\) made up of the \(N\) individual observations \(x^t_n\), each a vector of length \(p+1\), the number of features plus one (to have an intercept term). The prior distribution on our weights \(w\) reflects a lack of knowledge about the process

\[w \sim N(0,\alpha^{-1}I)\]

For example if there is one input we have \(w = (w_0, w_1)^t\) and setting \(\alpha = 1.0\) (arbitrarily) the prior looks like the graph below.

using Plots, Random, Distributions, LinearAlgebra
plotlyjs()
Random.seed!(1)
α = 1.0
d = MvNormal([0,0], (1/α)*I)
W0 = range(-1, 1, length=100)
W1 = range(-1, 1, length=100)
p_w = [pdf(d, [w0,w1]) for w0 in W0, w1 in W1]
contourf(W0, W1, p_w, color=:viridis,xlab="w0",ylab="w1",title="Prior: weight space")

Since we treat input features (the x's) as constants this implies a prior distribution for the output

\[y \sim N(0,\alpha X^t X)\]

From the function space view we can randomly sample functions at finite spacings \(\mathfrak{X} = \{x_1,…,x_N\}\) from the prior.

Random.seed!(1)
x1 = range(-1, 1, length=100)
X = [repeat([1],100) x1]
d = MvNormal(repeat([0],100), (1/α)*X*transpose(X) + 1e-10*I)
p = plot(x1,rand(d),legend=false,seriestype=:line,title="Prior: function space",xlabel="x",ylabel="y")
for i in 1:20
    plot!(p,x1,rand(d),legend=false,seriestype=:line)
end

The matrix \(K = \text{cov}(y) = \alpha^{-1} X^t X\) is made up of elements \(K_{nm} = k(x_n,x_m) = \frac{1}{\alpha}x_n^t x_m\) with \(k(x,x’)\) the kernel function. Notice that the kernel function \(k(x,x’)\) returns the variance for \(x = x’\) and covariance between \(x\) and \(x’\) otherwise. Also that we are talking here about the covariance between observations, not features. \(K\) is a \(N \times N\) matrix and so can be quite large. There are many potential kernel functions other than \(k = x^tx\) but that's for another day.

Modelling data with straight lines

We have a prior on \(y\) and then we observe some data. Let assume there is noise in the data so we observe

\[t_n = y(x_n) + \epsilon_n\]

with \(\epsilon_n \sim N(0,\beta)\) random noise that is independent between observations and \(t = \{t_1,…,t_N\}\) the observed output values for input features \(x_n\).

Random.seed!(1)
n = 10
x1 = range(-1, 1, length=n)
X = [repeat([1],n) x1]
β = 0.01
d = MvNormal(repeat([0],n), (1/α)*X*transpose(X) + β*I)
y = rand(d) 
p = scatter(x1,y,legend=false,title="Observed data",xlabel="x",ylabel="y")

At this point in practise we could estimate the noise parameter \(\beta\), but lets come back to that. For now assume we know that \(\beta = 0.01\). It is worth remember there are no weights giving us the intercept, slope etc but we can sample from our distribution of \(y|t\) or \(t*|t\) or given the observed data. Because our interest is in predicting for new observations we'd like to estimate the posterior \(p(t*|t,x,x*)\) for any future input \(x*\). It turns out the posterior for for any \(t*\) is another normal distribution which is coded below.

p = scatter(x1,y,legend=false,
            title="Posterior: function space",xlabel="x",ylabel="y")# new X's over which to predict
xs = range(-1, 1, length=100)
Xs = [repeat([1],100) xs]
ys = zeros(100)# get ready to construct posterior
σ2 = zeros(100)
C = (1/α)*X*transpose(X) + β*I
Cinv = inv(C)# one prediction at a time 
for i in 1:100
    k = X * Xs[i,:]
    c = Xs[i,:]' * Xs[i,:] + β
    ys[i] = (k' * Cinv) * y
    σ2[i] = c - (k' * Cinv) * k
end
plot!(p,xs,ys, ribbon=(2*sqrt.(σ2),2*sqrt.(σ2)), lab="estimate")
plot!(p,xs,ys)# noise free samples from the posterior
# all predictions at once
m = (Xs * X') * Cinv * y
CV = (Xs * Xs') - (Xs * X') * Cinv * (X * Xs')
CV = Symmetric(CV) + 1e-10*I
d = MvNormal(m, Symmetric(CV) + 1e-10*I)
for i in 1:20
    plot!(p,xs,rand(d),legend=false,seriestype=:line)
end

Forward model autodiff

By: Oisin Fitzgerald

Re-posted from: https://oizin.github.io/posts/autodiff-forward/index.html

Forward mode automatic differentiation

Oisín Fitzgerald, April 2021

A look at the first half (up to section 3.1) of:

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18.

https://www.jmlr.org/papers/volume18/17-468/17-468.pdf

Introduction

Automatic differentiation (autodiff) reminds me of Arthur C. Clarke's quote "any sufficiently advanced technology is indistinguishable from magic". Whereas computer based symbolic and numerical differentiation seem like natural descendants from blackboard based calculus, the first time I learnt about autodiff (through Pytorch) I was amazed. It is not that the ideas underlying autodiff themselves are particularly complex, indeed Bayin et al's look at the history of autodiff puts Wengert's 1964 paper entitled "A simple automatic derivative evaluation program" as a key moment in forward mode autodiff history. (BTW the paper is only 2 pages long – well worth taking a look). For me the magic comes from autodiff being this digitally inspired look at something as "ordinary" but so important to scientific computing and AI as differentiation.

Differentiation

If you are unsure of what the terms symbolic or numerical differentiation mean, I'll give a a quick overview below but would encourage you to read the paper and it's references for a more detailed exposition of their various strengths and weaknesses.

Numeric differentiation – wiggle the input

For a function \(f\) with a 1D input and output describing numeric differentiation (also known as the finite difference method) comes quite naturally from the definition of the derivative. The derivative is

\[\frac{df}{dx} = \text{lim}_{h \rightarrow 0}\frac{f(x+h)-f(x)}{h}\]

so we approximate this expression by picking a small enough \(h\) (there are more complex schemes). There are two sources of error here, the first is from approximating the infinitesimally small \(h\) with a plain finitely small \(h\) (truncation error) and the second is from round-off error. Round-off error occurs because not every number is represented in the set of floating point numbers so for a small \(h\) the difference \(f(x+h)-f(x)\) can be quite unstable. Unfortunately these two source of error play against each other (see graph – on the left hand size round-off error dominates whereas on the right hand side truncation error dominates).

using Plots
plotlyjs()
h = 10 .^ range(-15, -3, length=1000)
x0 = 0.2
f(x) = (64*x*(1-x)*(1-2*x)^2)*(1-8*x+8*x^2)^2
df = (f.(x0 .+ h) .- f(x0)) ./ h
plot(log10.(h),log10.(abs.((df .- 9.0660864))),
xlabel="log10(h)",ylabel="log10(|Error|)",legend=false)

However, such small errors are actually not all that important in machine learning! The main issue with numeric differentiation for machine learning is that the number of required evaluations of our function \(f\) scales linearly with the number of dimension of the gradient. In contrast backpropagation (an autodiff method) can calculate the derivatives in "two" evaluations of our function (one forward, one back).

Symbolic – fancy lookup tables

Symbolic differentiation is differentiation as you learnt it in school programmed into software, all the rules \(\frac{d}{dx}\text{cos}(x) = -\text{sin}(x), \frac{d}{dx} x^p = px^{(p-1)}, \frac{d}{dx}f(g(x)) = f'(g(x))g'(x)\) etc… are known and utilised by the software. If you evaluate the derivative of a function f using a symbolic programming language dfdx = derivative(f,x) the object returned dfdx is just whatever function the symbolic program matches as the derivative of f using it's internal derivative lookup and application of the rules of differentiation (chain rule etc). It is manipulation of expressions. The main issue with symbolic differentiation for ML (which anyone who has used Mathematica for a help with a difficult problem can attest to) is expression swell, where the derivative expression is exponentially longer than the original expression and involves repeated calculations.

Automatic differentiation – alter the program

Autodiff is the augmentation of a computer program to perform standard computations along with calculation of derivatives, there is no manipulation of expressions. It takes advantage of the fact that derivative expressions can be broken down into elementary operations that can be combined to give the derivative of the overall expression. I'll be more clear about elementary operations soon but you can think of an elementary operations as being any operation you could give to a node on a computational graph of your program.

Forward mode

To be more concrete about autodiff, let's look at forward mode. Consider evaluating \(f(x_1,x_2) = x_1 x_2 + \text{log}(x_1 ^2)\). We break this into the computational graph below and associate with each elementary operation the intermediate variable \(\dot{v}_i = \frac{\partial v_i}{\partial x}\), called the "tangent". The final "tangent" value \(\dot{v}_5\), which has been calculated as the function evaluates at the input (3,5) is a derivative at the point (3,5). What derivative exactly depends on the initial values of \(\dot{x_1}\) and \(\dot{x_2}\).

Sketching a forward mode autodiff library

It's surprisingly easy to implement forward mode autodiff in Julia (at least a naive form). Below I create a forward model module that creates a new object Dual that is a type of Number, and then proceed to overload common mathematical functions (e.g. sin and *) to account for this new number type. Each instance of Dual with have a prime and tangent slot. If we want the derivative with respect to argument x₁ of the function y = f(x₁,x₂) we simply set x₁.t = 1.0 (leaving x₂.t = 0.0) and check the value of y.t. For more see this video from MIT's Alan Edelman

import Base: +,-,/,*,^,sin,cos,exp,log,convert,promote_rule,printlnstruct Dual <: Number
  p::Number # prime
  t::Number # tangent
end+(x::Dual,y::Dual) = Dual(x.p + y.p, x.t + y.t)-(x::Dual,y::Dual) = Dual(x.p - y.p, x.t - y.t)/(x::Dual,y::Dual) = Dual(x.p/y.p, (x.t*y.p - x.p*y.t)/x.p^2)*(x::Dual,y::Dual) = Dual(x.p*y.p, x.t*y.p + x.p*y.t)sin(x::Dual) = Dual(sin(x.p), cos(x.p) * x.t)cos(x::Dual) = Dual(cos(x.p), -sin(x.p) * x.t)exp(x::Dual) = Dual(exp(x.p), exp(x.p) * x.t)log(x::Dual) = Dual(log(x.p), (1/x.p) * x.t)^(x::Dual,p::Int) = Dual(x.p^p,p*x.p^(p-1)* x.t)# We can think of dual numbers analogously to complex numbers
# The epsilon term will be the derivative
println(x::Dual) = println(x.p," + ",x.t,"ϵ")# deal with conversion, and Dual with non-Dual math
convert(::Type{Dual}, x::Number) = Dual((x,zero(x)))
promote_rule(::Type{Dual},::Type{<:Number}) = Dual;

Lets test on our example \(f(x_1,x_2) = x_1 x_2 + \text{log}(x_1 ^2)\), the derivative at (3,5) should be \(5 \frac{2}{3}\).

x1 = Dual(3.0,1.0)
x2 = Dual(5.0,0.0)
f(x1,x2) = x1*x2 + log(x1^2)
y = f(x1,x2)
# df/dx1
println(y.t)
# direct calculation
println(x2.p + 2/x1.p)
5.666666666666667
5.666666666666667

Conclusion

Autodiff is important in machine learning and scientific computing and (forward mode) surprisingly easy to implement. I'll look at reverse mode autodiff in another post.

Thanks to Oscar Perez Concha who helped with discussions on the content of this post.