Re-posted from: https://estadistika.github.io//bayesian/probability/modeling/inference/julia/2020/03/01/Bayesian-Sequential-Learning.html
In my previous article, I discussed how we can use Bayesian approach in estimating the parameters of the model. The process revolves around solving the following conditional probability, popularly known as the Bayes’ Theorem:
where $\mathbb{P}(\mathbf{w})$ is the a priori (prior distribution) for the objective parameters, $\mathbb{P}(\mathbf{y}|\mathbf{w})$ is the likelihood or model evidence, and $\mathbb{P}(\mathbf{y})$ is the normalizing constant with the following form:
where $\mathscr{P}$ is the parameter space.
Posterior Distribution
The details on the derivation of the a posteriori were also provided in the said article, but there were missing pieces, which I think is necessary for us to support our proposition, and thus we have the following result:
$$y_i\overset{\text{iid}}{\sim}\mathcal{N}(w_0+w_1x_i,\alpha^{-1})$$
and $\mathbf{w}\triangleq[w_0,w_1]^{\text{T}}$ s.t. $\mathbf{w}\overset{\text{iid}}{\sim}\mathcal{N}_2(\mathbf{0},\mathbf{I})$, then $\mathbf{w}|\mathbf{y}\overset{\text{iid}}{\sim}\mathcal{N}_2(\boldsymbol{\mu},\boldsymbol{\Sigma})$ where
$$
\begin{align}
\boldsymbol{\mu}&=\alpha\boldsymbol{\Sigma}\mathbf{\mathfrak{A}}^{\text{T}}\mathbf{y},\\
\boldsymbol{\Sigma}&=(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\mathfrak{A}+\beta\mathbf{I})^{-1}
\end{align}
$$
and $\boldsymbol{\mathfrak{A}}\triangleq\left[(\mathbf{x}_i^{\text{T}})\right]_{\forall i}$.
\begin{align}
y_i=\hat{y}_i+\varepsilon_i,\quad\varepsilon_i\overset{\text{iid}}{\sim}\mathcal{N}(0,1/\alpha),
\end{align}
where $\varepsilon_i$ is the innovation that the model can’t explain, and $\mathbb{Var}(\varepsilon_i)=\alpha^{-1}$ since $\mathbb{Var}(y_i)=\alpha^{-1}$ as given above. Then the likelihood of the model is given by:
\mathcal{L}(\mathbf{w}|\mathbf{y})\triangleq\mathbb{P}(\mathbf{y}|\mathbf{w})&=\prod_{\forall i}\frac{1}{\sqrt{2\pi}\alpha^{-1}}\text{exp}\left[-\frac{(y_i-\hat{y}_i)^2}{2\alpha^{-1}}\right]\\
&=\frac{\alpha^n}{(2\pi)^{n/2}}\text{exp}\left[-\alpha\sum_{\forall i}\frac{(y_i-\hat{y}_i)^2}{2}\right],
\end{align}
or in vector form:
\mathcal{L}(\mathbf{w}|\mathbf{y})\propto\text{exp}\left[-\frac{\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})}{2}\right],
\end{align}
where $\mathbf{\mathfrak{A}}\triangleq[(\mathbf{x}_i^{\text{T}})]_{\forall i}$ is the design matrix given above. If the a priori of the parameter is assumed to be standard bivariate Gaussian distribution, i.e. $\mathbf{w}\overset{\text{iid}}{\sim}\mathcal{N}_2(\mathbf{0}, \mathbf{I})$, then
\mathbb{P}(\mathbf{w}|\mathbf{y})&\propto\mathcal{L}(\mathbf{w}|\mathbf{y})\mathbb{P}(\mathbf{w})\\
&\propto\text{exp}\left[-\frac{\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})}{2}\right]\exp\left[-\frac{1}{2}\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right]\\
&\propto\text{exp}\left\{-\frac{1}{2}\left[\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})+\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right]\right\}.
\end{align}
Expanding the terms in the exponential function returns the following:
\alpha\mathbf{y}^{\text{T}}\mathbf{y}-2\alpha\mathbf{w}^{\text{T}}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}+\mathbf{w}^{\text{T}}(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I})\mathbf{w},
$$
thus
\mathbb{P}(\mathbf{w}|\mathbf{y})\propto\mathcal{C}\text{exp}\left\{-\frac{1}{2}\left[\mathbf{w}^{\text{T}}(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I})\mathbf{w}-2\alpha\mathbf{w}^{\text{T}}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}\right]\right\}.
\end{equation}
The inner terms of the exponential function is of the form $ax^2-2bx$. This a quadratic equation and therefore can be factored by completing the square. To do so, let $\mathbf{D}\triangleq\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I}$ and $\mathbf{b}\triangleq\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}$, then
\mathbb{P}(\mathbf{w}|\mathbf{y})&\propto\mathcal{C}\text{exp}\left[-\frac{1}{2}\left(\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{w}-2\mathbf{w}^{\text{T}}\mathbf{b}\right)\right]\\&=\mathcal{C}\text{exp}\left[-\frac{1}{2}\left(\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{w}-\mathbf{w}^{\text{T}}\mathbf{b}-\mathbf{b}^{\text{T}}\mathbf{w}\right)\right].
\end{align}
In order to proceed, the matrix $\mathbf{D}$ must be symmetric and invertible (this can be proven separately). If satisfied, then $\mathbf{I}\triangleq\mathbf{D}\mathbf{D}^{-1}=\mathbf{D}^{-1}\mathbf{D}$, so that the terms inside the exponential function above become:
\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{w}-\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{D}^{-1}\mathbf{b}-\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{w}+\underset{\text{constant introduced}}{\underbrace{(\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{b}-\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{b})}}.
$$
Finally, let $\boldsymbol{\Sigma}\triangleq\mathbf{D}^{-1}$ and $\boldsymbol{\mu}\triangleq\mathbf{D}^{-1}\mathbf{b}$, then
\mathbb{P}(\mathbf{w}|\mathbf{y})&\propto\mathcal{C}\text{exp}\left[-\frac{1}{2}\left(\mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\mathbf{w}-\mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}-\boldsymbol{\mu}^{\text{T}}\boldsymbol{\Sigma}^{-1}\mathbf{w}+\boldsymbol{\mu}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\right)\right]\\
&=\mathcal{C}\text{exp}\left[-\frac{(\mathbf{w}-\boldsymbol{\mu})^{\text{T}}\boldsymbol{\Sigma}^{-1}(\mathbf{w}-\boldsymbol{\mu})}{2}\right],
\end{align}
where $-\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{b}$ becomes part of $\mathcal{C}$, and that proves the proposition. $\blacksquare$
Simulation Experiment
The above result can be applied to any linear models (cross-sectional or time series), and I’m going to demonstrate how we can use it to model the following simulated data. I will be using Julia in Nextjournal (be sure to head over to this link for reproducibility), which already has an image available for version 1.3.1. Having said, some of the libraries are already pre-installed in the said image, for example Plots.jl. Thus, we only have to install the remaining libraries that we will need for this experiment.
Load the libraries as follows:
The theme
above simply sets the theme of the plots below. Further, for reproducibility purposes, I provided a seed as well. The following function will be used to simulate a cross-sectional data with population parameters $w_0\triangleq-.3$ and $w_1\triangleq-.5$ for 20 sample size.
From the above results, the parameters of the a posteriori can be implemented as follows:
One feature of Julia is it supports unicode, making it easy to relate the codes to the math above, i.e. Σ
and μ
in the code are obviously $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ above, respectively. The vector operations in Julia are also cleaner compared to that in R and in Python, for example we can encode Julia’s A'A
above as t(A) %*% A
in R, and A.T.dot(A)
in Python.
Finally, we simulate the data as follows:
We don’t really have to write down the wtrue
variable above since that’s the default values of w0
and w1
arguments, but we do so just for emphasis.
While the main subject here is Bayesian Statistics, it would be better if we have an idea as to what the Frequentist solution would be. As most of you are aware of, the solution to the weights above is given by the following normal equation:
this is a known result, and we can prove this in a separate article. The above equation is implemented as follows:
Therefore, for the current sample data, the estimate we got when assuming the weights as fixed and unknown is $\hat{\mathbf{w}}=[-0.32264, -0.59357]^{\text{T}}$. The figure below depicts the corresponding fitted line.
Not bad given that we have very small dataset. Now let’s proceed and see how we can infer the parameters in Bayesian framework. The prior distribution can be implemented as follows:
As indicated in the above proposition, the parameters are jointly modelled by a bivariate Normal distribution, as indicated by the dimension of the hyperparameter μ
above. Indeed, the true parameter we are interested in is the weight vector, $\mathbf{w}$, but because we considered it to be random, then the parameters of the model we assign to it are called hyperparameters, in this case the vector $\boldsymbol{\mu}$ and the identity matrix $\mathbf{I}$ of the prior distribution.
Moreover, the likelihood of the data can be implemented as follows:
The mean of the likelihood is the specified linear model itself, which in vector form is the inner product of the transformed design matrix, $\boldsymbol{\mathfrak{A}}$, and the weights vector, $\mathbf{w}$, i.e. A[i, :]'w
. This assumption is valid since the fitted line must be at the center of the data points, and that the error should be random. One of my favorite features of Julia language is the multiple dispatch. For example, the two likelihood
functions defined above are not in conflict since Julia evaluates the inputs based on the type of the function arguments. The same is true for the posterior distribution implemented below. Unlike in R and in Python, I usually have to write this as a helper function, e.g. likelihood_helper
.
Finally, the prediction is done by sampling the weights from the posterior distribution. The center of these weights is of course the mean of the a posteriori.
For example, to sample 30 weights from the posterior distribution using all the sampled data, and return the corresponding predictions, is done as follows:
The predicted function returns both x and y values, that’s why we indexed the above result to 2, to show the predicted ys only. Further, to use only the first 10 observations of the data for calculating the $\hat{y}$, is done as follows:
So you might be wondering, what’s the motivation of only using the first 10 and not all observations? Well, we want to demonstrate how Bayesian inference learns the weights or the parameters of the model sequentially.
Visualization
At this point, we are ready to generate our main vis. The first function (dataplot
) plots the generated fitted lines from the a priori or a posteriori, both of which is plotted using the extended method of contour
we defined below.
Tying all the codes together, gives us this beautiful grid plot.
Since we went with style before comprehension, let me guide you then with the axes. All figures have unit square space, with contour plots having the following axes: $w_0$ (the x-axis) and $w_1$ (the y-axis). Obviously, the data space has the following axes: predictor (the x-axis) and response (the y-axis).
Discussions
We commenced the article with emphasis on the approach of Bayesian Statistics to modeling, whereby the estimation of the parameters as mentioned is based on the Bayes’ Theorem, which is a conditional probability with the following form:
\begin{equation}
\mathbb{P}(\mathbf{w}|\mathbf{y})=\frac{\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})}{\mathbb{P}(\mathbf{y})}.
\end{equation}
Now we will relate this to the above figure using some analogy, as to how the model sequentially learns the optimal estimate for the parameter of the linear model.
Consider the following: say you forgot where you left your phone, and for some reason you can’t ring it up, because it could be dead or can’t pick up some signals. Further, suppose you don’t wanna look for it, rather you let Mr. Bayes, your staff, to do the task. How would he then proceed? Well, let us consider the weight vector, $\mathbf{w}\triangleq[w_0,w_1]^{\text{T}}$, be the location of your phone. In order to find or at least best approximate the exact location, we need to first consider some prior knowledge of the event. In this case, we need to ask the following questions: where were you the last time you had the phone? Were you in the living room? Or in the kitchen? Or in the garage? And so on. In the context of modeling, this prior knowledge about the true location can be described by a probability distribution, and we refer to this as the a priori (or the prior distribution). These set of possible distributions obviously are models itself with parameters, as mentioned above, referred to as the hyperparameters, which we can tweak to describe our prior knowledge of the event. For example, you might consider the kitchen as the most probable place where you left your phone. So we adjust the location parameter of our a priori model to where the kitchen is. Hence Mr. Bayes should be in the kitchen already, assessing the coverage of his search area. Of course, you need to help Mr. Bayes on the extent of the coverage. This coverage or domain can be described by the scale parameter of your a priori. If we relate this to the main plot, we assumed the prior distribution over the weight vector to be standard bivariate Gaussian distribution, centered at zero vector with identity variance-covariance matrix. Since the prior knowledge can have broad domain or coverage on the possible values of the weight vector, the samples we get generates random fitted lines as we see in the right-most plot of the first row of the figure above.
Once we have the prior knowledge in place, that is, we are already in the kitchen and we know how wide the search area likely to be, we can start looking for evidence. The evidence is the realization of your true model, relating to the math above, these realizations are the $y_i$s, coming from $y_i=f(x|\mathbf{w})$, where $f(x|\mathbf{w})$ is the link function of the true model, which we attempt to approximate with our hypothesized link function, $h(x|\hat{\mathbf{w}})$, that generated the predicted $\hat{y}_i$s. For example, you may not know where exactly your phone is, but you are sure with your habits. So you inform Mr. Bayes, that the last time you were with your phone in the kitchen was drinking some coffee. Mr. Bayes will then use this as his first evidence, and assess the likelihood of each suspected location in the kitchen. That is, what is the likelihood that a particular location, formed (or realized, generated or connected to) the first evidence (taking some coffee)? For example, is it even comfortable to drink coffee in the sink? Obviously not, so very low likelihood, but likely in the dining table or close to where the coffee maker is. If we assess all possible location within our coverage using the first evidence, we get the profile likelihood, which is what we have in the first column of the grid plot above, profile likelihood for the ith evidence. Further, with the first evidence observed, the prior knowledge of Mr. Bayes needs to be updated to obtain the posterior distribution. The new distribution will have an updated location and scale parameters. If we relate to the above figure, we can see the samples of the fitted lines in the data space plot (third column, second row), starting to make guesses of possible lines given the first evidence observed. Moving on, you inform Mr. Bayes of the second evidence, that you were reading some newspaper while having some coffee. At this point, the prior belief of Mr. Bayes, for the next posterior, will be the posterior of the first evidence, and so the coverage becomes restrictive and with new location, which further help Mr. Bayes on managing the search area. The second evidence, as mentioned, will then return a new posterior. You do this again and again, informing Mr. Bayes of other evidences sequentially until the last evidence. The final evidence will end up with the final posterior distribution, which we expect to have new location parameter, closer to the exact location, and small scale parameter, covering the small circle of the exact solution. The final posterior will then be your best guess that would describe the exact location of your phone.
This may not be the best analogy, but that is how the above figure sequentially learns the optimal estimate for the weight vector in Bayesian framework.
Bayesian Deep Learning
This section deserves a separate article, but I will briefly give some motivation on how we can generalize the above discussion into complex modeling.
The intention of the article is to give the reader a low-level understanding of how the Bayes’ theorem works, and without loss of generalization, I decided to go with simple linear regression to demonstrate the above subject. However, this can be applied to any model indexed by or a function of some parameters or weights $\mathbf{w}$, with the assumption that the solution is random but govern by some probability distribution.
Complex modeling such as in Deep Learning are usually based on the assumption that the weights are fixed and unknown, which in Statistics is the Frequentist approach to inference, but without assuming some probability distribution on the error of the model. Therefore, if we are to assume some randomness on the weights, we can then use Bayesian inference to derive or at least approximate (for models with no closed-form solution) the posterior distribution. Approximate Bayesian inference are done via Markov Chain Monte Carlo (MCMC) or Variational Inference, which we can tackle in a separate post.
Libraries
There are several libraries for doing Bayesian inference, the classic and still one of the most powertful library is Stan. For Python, we have PyMC3, Pyro (based on Pytorch), and TensorFlow Porbability. For Julia, we have Turing.jl, Mamba.jl, Gen.jl, and Stan.jl. I will have a separate article for these libraries.
Next Steps
The obvious next steps for readers to try out is to model the variance as well, since in the above result, the variance of the innovation or the error is known and is equal to $\alpha$. Further, one might consider the Frequentist sequential learning as well. Or proceed with other nonlinear complex models, such as Neural Networks. We can have these in a separate article.