Author Archives: Tom Breloff

Machine Learning and Visualization in Julia

Re-posted from: http://www.breloff.com/JuliaML-and-Plots/

In this post, I’ll introduce you to the Julia programming language and a couple long-term projects of mine: Plots for easily building complex data visualizations, and JuliaML for machine learning and AI. After short introductions to each, we’ll quickly throw together some custom code to build and visualize the training of an artificial neural network. Julia is fast, but you’ll see that the real speed comes from developer productivity.

Introduction to Julia

Julia is a fantastic, game-changing language. I’ve been coding for 25 years, using mainstays like C, C++, Python, Java, Matlab, and Mathematica. I’ve dabbled in many others: Go, Erlang, Haskel, VB, C#, Javascript, Lisp, etc. For every great thing about each of these languages, there’s something equally bad to offset it. I could never escape the “two-language problem”, which is when you must maintain a multi-language code base to deal with the deficiencies of each language. C can be fast to run, but it certainly isn’t fast to code. The lack of high-level interfaces means you’ll need to do most of your analysis work in another language. For me, that was usually Python. Now… Python can be great, but when it’s not good enough… ugh.

Python excels when you want high level and your required functionality already exists. If you want to implement a new algorithm with even minor complexity, you’ll likely need another language. (Yes… Cython is another language.) C is great when you just want to move some bytes around. But as soon as you leave the “sweet spot” of these respective languages, everything becomes prohibitively difficult.

Julia is amazing because you can properly abstract exactly the right amount. Write pseudocode and watch it run (and usually fast!) Easily create strongly-typed custom data manipulators. Write a macro to automate generation of your boilerplate code. Use generated functions to produce highly specialized code paths depending on input types. Create your own mini-language for domain-specificity. I often find myself designing solutions to problems that simply should not be attempted in other languages.

using Plots
pyplot()
labs = split("Julia C/C++ Python Matlab Mathematica Java Go Erlang")
ease = [0.8, 0.1, 0.8, 0.7, 0.6, 0.3, 0.5, 0.5]
power = [0.9, 0.9, 0.3, 0.4, 0.2, 0.8, 0.7, 0.5]
txts = map(i->text(labs[i], font(round(Int, 5+15*power[i]*ease[i]))), 1:length(labs))
scatter(ease, power,
    series_annotations=txts, ms=0, leg=false,
    xguide="Productivity", yguide="Power",
    formatter=x->"", grid=false, lims=(0,1)
)

I won’t waste time going through Julia basics here. For the new users, there are many resources for learning. The takeaway is: if you’re reading this post and you haven’t tried Julia, drop what you’re doing it and give it a try. With services like JuliaBox, you really don’t have an excuse.

Introduction to Plots

Plots (and the JuliaPlots ecosystem) are modular tools and a cohesive interface, which let you very simply define and manipulate visualizations.

One of its strengths is the varied supported backends. Choose text-based plotting from a remote server or real-time 3D simulations. Fast, interactive, lightweight, or complex… all without changing your code. Massive thanks to the creators and developers of the many backend packages, and especially to Josef Heinen and Simon Danisch for their work in integrating the awesome GR and GLVisualize frameworks.

However, more powerful than any individual feature is the concept of recipes. A recipe can be simply defined as a conversion with attributes. “User recipes” and “type recipes” can be defined on custom types to enable them to be “plotted” just like anything else. For example, the Game type in my AtariAlgos package will capture the current screen from an Atari game and display it as an image plot with the simple command plot(game):

“Series recipes” allow you to build up complex visualizations in a modular way. For example, a histogram recipe will bin data and return a bar plot, while a bar recipe can in turn be defined as a bunch of shapes. The modularity greatly simplifies generic plot design. Using modular recipes, we are able to implement boxplots and violin plots, even when a backend only supports simple drawing of lines and shapes:

To see many more examples of recipes in the wild, check out StatPlots, PlotRecipes, and more in the wider ecosystem.

For a more complete introduction of Plots, see my JuliaCon 2016 workshop and read through the documentation

Introduction to JuliaML

JuliaML (Machine Learning in Julia) is a community organization that was formed to brainstorm and design cohesive alternatives for data science. We believe that Julia has the potential to change the way researchers approach science, enabling algorithm designers to truly “think outside the box” (because of the difficulty of implementing non-conventional approaches in other languages). Many of us have independently developed tools for machine learning before contributing. Some of my contributions to the current codebase in JuliaML are copied-from or inspired-by my work in OnlineAI.

The recent initiatives in the Learn ecosystem (LearnBase, Losses, Transformations, Penalties, ObjectiveFunctions, and StochasticOptimization) were spawned during the 2016 JuliaCon hackathon at MIT. Many of us, including Josh Day, Alex Williams, and Christof Stocker (by Skype), stood in front of a giant blackboard and hashed out the general design. Our goal was to provide fast, reliable building blocks for machine learning researchers, and to unify the existing fragmented development efforts.

Learn: The “meta” package for JuliaML, which imports and re-exports many of the packages in the JuliaML organization. This is the easiest way to get everything installed and loaded.
LearnBase: Lightweight method stubs and abstractions. Most packages import (and re-export) the methods and abstract types from LearnBase.
Losses: A collection of types and methods for computing loss functions for supervised learning. Both distance-based (regression/classification) and margin-based (Support Vector Machine) losses are supported. Optimized methods for working with array data are provided with both allocating and non-allocating versions. This package was originally Evizero/LearnBase.jl. Much of the development is by Christof Stocker, with contributions from Alex Williams and myself.
Transformations: Tensor operations with attached storage for values and gradients: activations, linear algebra, neural networks, and more. The concept is that each Transformation has both input and output Node for input and output arrays. These nodes implicitly link to storage for the current values and current gradients. Nodes can be “linked” together in order to point to the same underlying storage, which makes it simple to create complex directed graphs of modular computations and perform backpropagation of gradients. A Chain (generalization of a feedforward neural network) is just an ordered list of sub-transformations with appropriately linked nodes. A Learnable is a special type of transformation that also has parameters which can be learned. Utilizing Julia’s awesome array abstractions, we can collect params from many underlying transformations into a single vector, and avoid costly copying and memory allocations. I am the primary developer of Transformations.
Penalties: A collection of types and methods for regularization functions (penalties), which are typically part of a total model loss in learning algorithms. Josh Day (creator of the awesome OnlineStats) is the primary developer of Penalties.
ObjectiveFunctions: Combine transformations, losses, and penalties into an objective. Much of the interface is shared with Transformations, though this package allows for flexible Empirical Risk Minimization and similar optimization. I am the primary developer on the current implementation.
StochasticOptimization: A generic framework for optimization. The initial focus has been on gradient descent, but I have hopes that the framework design will be adopted by other classic optimization frameworks, like Optim. There are many gradient descent methods included: SGD with momentum, Adagrad, Adadelta, Adam, Adamax, and RMSProp. The flexible “Master Learner” framework provides a modular approach to optimization algorithms, allowing developers to add convergence criteria, custom iteration traces, plotting, animation, etc. We’ll see this flexibility in the example below. We have also redesigned data iteration/sampling/splitting, and the new iteration framework is currently housed in StochasticOptimization (though it will eventually live in MLDataUtils). I am the primary developer for this package.

Learning MNIST

Time to code! I’ll walk you through some code to build, learn, and visualize a fully connected neural network for the MNIST dataset. The steps I’ll cover are:

Load and initialize Learn and Plots
Build a special wrapper for our trace plotting
Load the MNIST dataset
Build a neural net and our objective function
Create custom traces for our optimizer
Build a learner, and learn optimal parameters

Custom visualization for tracking MNIST fit

Disclaimers:

I expect you have a basic understanding of gradient descent optimization and machine learning models. I don’t have the time or space to explain those concepts in detail, and there are plenty of other resources for that.
Basic knowledge of Julia syntax/concepts would be very helpful.
This API is subject to change, and this should be considered pre-alpha software.
This assumes you are using Julia 0.5.

Get the software (use Pkg.checkout on a package for the latest features):

# Install Learn, which will install all the JuliaML packages
Pkg.clone("https://github.com/JuliaML/Learn.jl")
Pkg.build("Learn")
Pkg.checkout("MLDataUtils", "tom") # call Pkg.free if/when this branch is merged

# A package to load the data
Pkg.add("MNIST")

# Install Plots and StatPlots
Pkg.add("Plots")
Pkg.add("StatPlots")

# Install GR -- the backend we'll use for Plots
Pkg.add("GR")

Start up Julia, then load the packages:

using Learn
import MNIST
using MLDataUtils
using StatsBase
using StatPlots

# Set up GR for plotting. x11 is uglier, but much faster
ENV["GKS_WSTYPE"] = "x11"
gr(leg=false, linealpha=0.5)

A custom type to simplify the creation of trace plots (which will probably be added to MLPlots):

# the type, parameterized by the indices and plotting backend
type TracePlot{I,T}
    indices::I
    plt::Plot{T}
end

getplt(tp::TracePlot) = tp.plt

# construct a TracePlot for n series.  note we pass through
# any keyword arguments to the `plot` call
function TracePlot(n::Int = 1; maxn::Int = 500, kw...)
    indices = if n > maxn
        # limit to maxn series, randomly sampled
        shuffle(1:n)[1:maxn]
    else
        1:n
    end
    TracePlot(indices, plot(length(indices); kw...))
end

# add a y-vector for value x
function add_data(tp::TracePlot, x::Number, y::AbstractVector)
    for (i,idx) in enumerate(tp.indices)
        push!(tp.plt.series_list[i], x, y[idx])
    end
end

# convenience: if y is a number, wrap it as a vector and call the other method
add_data(tp::TracePlot, x::Number, y::Number) = add_data(tp, x, [y])

Load the MNIST data and preprocess:

# our data:
x_train, y_train = MNIST.traindata()
x_test, y_test = MNIST.testdata()

# normalize the input data given μ/σ for the input training data
μ, σ = rescale!(x_train)
rescale!(x_test, μ, σ)

# convert class vector to "one hot" matrix
y_train, y_test = map(to_one_hot, (y_train, y_test))

train = (x_train, y_train)
test = (x_test, y_test)

Build a neural net with softplus activations for the inner layers and softmax output for classification:

nin, nh, nout = 784, [50,50], 10
t = nnet(nin, nout, nh, :softplus, :softmax)

Note: the nnet method is a very simple convenience constructor for Chain transformations. It’s pretty easy to construct the transformation yourself for more complex models. This is what is constructed on the call to nnet:

Chain{Float64}(
   Affine{784-->50}
   softplus{50}
   Affine{50-->50}
   softplus{50}
   Affine{50-->10}
   softmax{10}
)

Create an objective function to minimize, adding an Elastic (combined L1/L2) penalty/regularization. Note that the cross-entropy loss function is inferred automatically for us since we are using softmax output:

obj = objective(t, ElasticNetPenalty(1e-5))

Build TracePlot objects for our custom visualization:

# parameter plots
pidx = 1:2:length(t)
pvalplts = [TracePlot(length(params(t[i])), title="$(t[i])") for i=pidx]
ylabel!(pvalplts[1].plt, "Param Vals")
pgradplts = [TracePlot(length(params(t[i]))) for i=pidx]
ylabel!(pgradplts[1].plt, "Param Grads")

# nnet plots of values and gradients
valinplts = [TracePlot(input_length(t[i]), title="input", yguide="Layer Value") for i=1:1]
valoutplts = [TracePlot(output_length(t[i]), title="$(t[i])", titlepos=:left) for i=1:length(t)]
gradinplts = [TracePlot(input_length(t[i]), yguide="Layer Grad") for i=1:1]
gradoutplts = [TracePlot(output_length(t[i])) for i=1:length(t)]

# loss/accuracy plots
lossplt = TracePlot(title="Test Loss", ylim=(0,Inf))
accuracyplt = TracePlot(title="Accuracy", ylim=(0.6,1))

Add a method for computing the loss and accuracy on a subsample of test data:

function my_test_loss(obj, testdata, totcount = 500)
    totloss = 0.0
    totcorrect = 0
    for (x,y) in eachobs(rand(eachobs(testdata), totcount))
        totloss += transform!(obj,y,x)

        # logistic version:
        # ŷ = output_value(obj.transformation)[1]
        # correct = (ŷ > 0.5 && y > 0.5) || (ŷ <= 0.5 && y < 0.5)

        # softmax version:
        ŷ = output_value(obj.transformation)
        chosen_idx = indmax(ŷ)
        correct = y[chosen_idx] > 0

        totcorrect += correct
    end
    totloss, totcorrect/totcount
end

Our custom trace method which will be called after each minibatch:

tracer = IterFunction((obj, i) -> begin
    n = 100
    mod1(i,n)==n || return false

    # param trace
    for (j,k) in enumerate(pidx)
        add_data(pvalplts[j], i, params(t[k]))
        add_data(pgradplts[j], i, grad(t[k]))
    end

    # input/output trace
    for j=1:length(t)
        if j==1
            add_data(valinplts[j], i, input_value(t[j]))
            add_data(gradinplts[j], i, input_grad(t[j]))
        end
        add_data(valoutplts[j], i, output_value(t[j]))
        add_data(gradoutplts[j], i, output_grad(t[j]))
    end

    # compute approximate test loss and trace it
    if mod1(i,500)==500
        totloss, accuracy = my_test_loss(obj, test, 500)
        add_data(lossplt, i, totloss)
        add_data(accuracyplt, i, accuracy)
    end

    # build a heatmap of the total outgoing weight from each pixel
    pixel_importance = reshape(sum(t[1].params.views[1],1), 28, 28)
    hmplt = heatmap(pixel_importance, ratio=1)

    # build a nested-grid layout for all the trace plots
    plot(
        map(getplt, vcat(
                pvalplts, pgradplts,
                valinplts, valoutplts,
                gradinplts, gradoutplts,
                lossplt, accuracyplt
            ))...,
        hmplt,
        size = (1400,1000),
        layout=@layout([
            grid(2,length(pvalplts))
            grid(2,length(valoutplts)+1)
            grid(1,3){0.2h}
        ])
    )

    # show the plot
    gui()
end)

# trace once before we start learning to see initial values
tracer.f(obj, 0)

Finally, we build our learner and learn! We’ll use the Adadelta method with a learning rate of 0.05. Notice we just added our custom tracer to the list of parameters… we could have added others if we wanted. The make_learner method is just a convenience to optionally construct a MasterLearner with some common sub-learners. In this case we add a MaxIter(50000) sub-learner to stop the optimization after 50000 iterations.

We will train on randomly-sampled minibatches of 5 observations at a time, and update our parameters using the average gradient:

learner = make_learner(
    GradientDescent(5e-2, Adadelta()),
    tracer,
    maxiter = 50000
)
learn!(obj, learner, infinite_batches(train, size=5))

A snapshot after training for 30000 iterations

After a little while we are able to predict ~97% of the test examples correctly. The heatmap (which represents the “importance” of each pixel according to the outgoing weights of our model) depicts the important curves we have learned to distinguish the digits. The performance can be improved, and I might devote future posts to the many ways we could improve our model, however model performance was not my focus here. Rather I wanted to highlight and display the flexibility in learning and visualizing machine learning models.

Conclusion

There are many approaches and toolsets for data science. In the future, I hope that the ease of development in Julia convinces people to move their considerable resources away from inefficient languages and towards Julia. I’d like to see Learn.jl become the generic interface for all things learning, similar to how Plots is slowly becoming the center of visualization in Julia.

If you have questions, or want to help out, come chat. For those in the reinforcement learning community, I’ll probably focus my next post on Reinforce, AtariAlgos, and OpenAIGym. I’m open to many types of collaboration. In addition, I can consult and/or advise on many topics in finance and data science. If you think I can help you, or you can help me, please don’t hesitate to get in touch.

Neurons are computers: The Computational Power of Dendrites

By: Tom Breloff

Re-posted from: http://www.breloff.com/Neurons-are-computers/

The human brain is composed of billions of computational units, called neurons. Neurons transmit information to each other through electrical pulses, or spikes. A common misrepresentation is that the neuron is the basic conputational building block; that spikes are transferred and integrated at the soma, and that axons and dendrites are simply a network of electrical cables connecting the neurons together. Treating the dendrite as cabling is convenient; the mathematics are simpler, and it’s easier to reason about the role of neurons, both individually and in groups. As we’ll see, this simplification is wrong, and it glosses over a critical computational component in our neural architecture.

If you missed it, please check out my first post: Efficiency is Key: Lessons from the Human Brain

Some of the graphics and animations below were created in Julia with Plots.jl. Check out the IJulia notebook.

Figure A: Components of a Neuron [1]

The Basics of Spiking

A neuron’s output is a sequence of electrical bursts; termed action potentials, or spikes. A neuron’s cell body (soma) accumulates post-synaptic (after the synaptic cleft) chemical energy in the form of a membrane potential. At the site of the axon hilcock, net input potentials are summated, and a spike is generated. This spike travels quickly through short sections of the axon surrounded by myelin sheath, where the signal is then boosted at each node (the gaps between sheaths). This successive boosting allows the current to quickly reach the tips of the axonal branches.

Axons (output) form connections with the dendrites (input) of other neurons; this connection is called a synapse. When a signal reaches the end of an axon at a synapse, it triggers the release of neurotransmitters which then cross the synaptic gap, binding and opening channels on the dendrite. The transfer of chemicals (for example sodium and potassium) are what change the interal charge of the dendrite, and also the soma. This charge builds up until crossing a theoretical membrane threshold and triggering a run-away process which results in a spike. (See this short video for a visual explanation of the process)

Axons terminate at many pre-synaptic (before the synaptic cleft) sites. Synaptic terminals are possible in many configurations. Axons can synapse directly with the soma (axosomatic), leading to direct potential transfer, or they could even synapse with another synaptic terminal (axosynaptic), leading to modulatory effects. However the most common connection is axon-to-dendrite (axodendritic).

Figure B: Synaptic Connection Locations [1]

Location, location, location…

The location of the synaptic terminal has a large effect on the resulting change in membrane potential (which, when high enough, will produce a spike). Those synapses connecting axon to soma (axosomatic) have nearly direct energy transfer; the postsynaptic response (i.e. how the receiving neuron responds) will be somewhat linear when compared with the presynaptic input.

For synapses on the dendritic tree, the postsynaptic potential (PSP) is affected both by passive properties of dendrites [2], and synaptic inhibition from neighboring synapses [3], both driving a sublinear response. This means that postsynapitic potentials have diminishing marginal response as more synapses on the dendritic segment are active.

In addition, dendritic synapses have active properties. When synapses are active in isolation, the relative response at the soma is generally small. However, when multiple synapses become active with spatio-temporal proximity (meaning nearby in both space and time) then it can trigger a supralinear response: a dendritic spike. This active conductance is relatively large as compared to the individial postsynaptic responses; however it quickly saturates, and additional presynaptic input has little effect.

Figure C: Membrane Potential Response Shapes

This means that, depending on the structure and properties of a neuron’s dendritic tree and synaptic connectivity, membrane potential can be effected in highly nonlinear ways. Notice in Figure C how the sublinear line is the square root of input, while the supralinear line is similar to the sigmoid function. This is intentional to emphasize parallels to current deep learning methods. For more details on the sublinear and supralinear curves, see [3].

Dendritic Segments as a Key Computational Component

As hinted before, nearby synapses interact in nonlinear ways when they are active at the same time. It turns out that relative distance is not the defining characteristic; synapses are grouped into dendritic segments which are somewhat isolated from each other. Depending on the properties of the segment, as well as the location, the synaptic interaction can be linear, sublinear, or supralinear.

Remember also that dendrites have a branching, tree-like structure. One advantage is that the branch points, where the tree splits, have inhibitory properties. Thus the dendritic forks can integrate or inhibit potentials from more distal dentritic areas.

To summarize: A neuron has dendritic branches protruding from the soma. Those branches split at dendritic forks into more branches. On each branch are some number (possibly zero) of dendritic segments. A segment may contain many axodendritic synapses. Integration, inhibition, and amplification can occur at every level of the heirarchy, creating a rich and deep computation structure within a single neuron. In Julia:

abstract NeuralComponent

type Synapse <: NeuralComponent
    presynaptic_neuron
    presynaptic_current
end

type Segment <: NeuralComponent
    synapses
end

type Branch <: NeuralComponent
    segments
    fork
end    

type Soma
    inputs::Set{NeuralComponent}
    membrane_potential::Float64
end

To be continued… (when I have time)

[1] Blausen.com staff. “Blausen gallery 2014”. Wikiversity Journal of Medicine. doi: 10.15347/wjm/2014.010. ISSN 20018762.

[2] Abrahamsson, T, Cathala, L, Matsui, K, Shigemoto, R, and Digregorio, D. A. (2012). Thin dendrites of cerebellar interneurons confer sublinear synaptic integration and a gradient of short-term plasticity. Neuron. 73, 1159-1172. doi: 10.1016/j.neuron.2012.01.027

[3] Tran-Van-Minh A, Caze RD, Abrahamsson T, Cathala L, Gutkin BS and DiGregorio DA (2015) Contribution of sublinear and supralinear dendritic integration to neuronal computations. Front. Cell. Neurosci. 9:67. doi: 10.3389/fncel.2015.00067

Efficiency is Key: Lessons from the Human Brain.

By: Tom Breloff

Re-posted from: http://www.breloff.com/Efficiency-is-key/

The human brain is intensely complicated. Memories, motor sequences, emotions, language, and more are all maintained and enacted solely through the temporary and fleeting transfer of energy between neurons: the slow release of neurotransmitters across synapses, dendritic integration, and finally the somatic spike. A single spike (somatic action potential) will last a small fraction of a second, and yet somehow we are able to swing a baseball bat, compose a symphony, and apply memories from decades in the past. How can our brain be based on signals of such short duration, and yet work on such abstract concepts stretching vast time scales?

In this blog, I hope to lay out some core foundations and research in computation neuroscience and machine learning which I feel will comprise the core components of an eventual artificially intelligent system. I’ll argue that rate-based artificial neural networks (ANN) have limited power, partially due to the removal of the important fourth dimension: time. I also hope to highlight some important areas of research which could help bridge the gap from “useful tool” to “intelligent machine”. I will not give a complete list of citations as that would probably take me longer to compile than writing this blog, but I will occasionally mention references which I feel are important contributions or convey a concept well.

These views are my own opinion, formed after studying many related areas in computational neuroscience, deep learning, reservoir computing, neuronal dynamics, computer vision, and more. This personal study is heavily complemented with my background in statistics and optimization, and 25 years of experience with computer programming, design, and algorithms. Recently I have contributed MIT-licensed software for the awesome Julia programming language. For those in New York City, we hope to see you at the next meetup!

See my bio for more information.

What is intelligence?

What does it mean to be intelligent? Are dogs intelligent? Mice? What about a population of ants, working together toward a common goal? I won’t give a definitive answer, and this is a topic which easily creates heated disagreement. However, I will roughly assume that intelligence involves robust predictive extrapolation/generalization into new environments and patterns using historical context. As an example, an intelligent agent would predict that they would sink through mud slowly, having only experienced the properties of dirt and water independently, while a Weak AI system would likely say “I don’t know… I’ve never seen that substance” or worse: “It’s brown, so I expect it will be the same as dirt”.

Intelligence need not be human-like, though that is the easiest kind to understand. I foresee intelligent agents sensing traffic patterns throughout a city and controlling stoplight timings, or financial regulatory agents monitoring transaction flow across markets and continents to pinpoint criminal activity. In my eyes, these are sufficiently similar to a human brain which senses visual and auditory inputs and acts on the world through body mechanics, learning from the environment and experience as necessary. While the sensorimotor components are obviously very different between these examples and humans, the core underlying (intelligent) algorithms may be surprisingly similar.

Some background: Neural Network Generations

I assume you have some basic understanding about artificial neural networks going forward, though a quick Google search will give you plenty of background for most of the research areas mentioned.

There is no clear consensus on the generational classification of neural networks. Here I will take the following views:

First generation networks employ a thresholding (binary output) activation function. An example is the classic Perceptron.
Second generation networks employ (mostly) continuously differentiable activation functions (sigmoid, tanh, or alternatives such as the Rectified Linear Unit (ReLU) and variants, which have discontinuities in the derivative at zero), following an inner product of weights and inputs to a neuron. Some have transformational processing steps like the convolutions and pooling layers of CNNs. Most ANNs today are second generation, including impressively large and powerful Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Learning (DL), Neural Turing Machines (NTM), and much, much more.
Third generation networks add a fourth dimension (time), and are likely built using an “energy propagation” mechanism which converts and accumulates inputs over time, propagating a memory of inputs through the network. Examples include Spiking Neural Networks (SNN), Liquid State Machines (LSM), and Hierarchical Temporal Memory (HTM).

The first two generations of networks are static, in the sense that there is no explicit time component. Of course, they can be made to represent time through additional structure (such as in RNNs) or transformed inputs (such as appending lagged inputs as in ARIMA-type models). Network dynamics can be changed through learning, but that structure must be explicitly represented by the network designer.

In the last few years, there has been incredible advances in the expressive power of second generation networks. Networks have been built which can approach or surpass human ability in object recognition, language translation, pattern recognition, and even creativity. While this is impressive, most second generation networks have problems of fragility and scalability. A network with hundreds of millions of parameters (or more) requires tons of computing power and labeled training samples to effectively learn its goal (such as this awesome network from OpenAI’s Andrej Karpathy). This means that acquiring massive labeled data sets and compute power are required when creating useful networks (and the reason that Google, Facebook, Apple, etc are the companies currently winning this game).

I should note that none of the network types that I’ve listed are “brain-like”. Most have only abstract similarities to a real network of cortical neurons. First and second order networks roughly approximate a “rate-based” model of neural activity, which means the instantaneous mean firing rate of a neuron is the only output, and the specific timings of neuronal firings are ignored. Research areas like Deep Reinforcement Learning are worthwhile extensions to ANNs, as they get closer to the required brain functionality of an agent which learns through sensorimotor interaction with an environment, however the current attempts do not come close to the varied dynamics found in real brains.

SNN and LSM networks incorporate specific spike timing as a core piece of their models, however they still lack the functional expressiveness of the brain: dendritic computation, chemical energy propagation, synaptic delays, and more (which I hope to cover in more detail another time). In addition, the added complexity makes interpretation of the network dynamics difficult. HTM networks get closer to “brain-like” dynamics than many other models, however the choice of binary integration and binary outputs are a questionable trade-off for many real world tasks, and it’s easy to wonder if they will beat finely tuned continuously differentiable networks in practical tasks.

More background: Methods of Learning

There are two core types of learning: supervised and unsupervised. In supervised learning, there is a “teacher” or “critic”, which gives you input, compares your output to some “correct answer”, and gives you a numerical quantity representing your error. The classic method of learning in second generation networks is to use the method of backpropagation to project that error backwards through your network, updating individual network parameters based on the contribution of that parameter to the resulting error. The requirement of a (mostly) continuously differentiable error function and network topology is critical for backpropagation, as it uses a simple calculus trick known as the Chain Rule to update network weights. This method works amazingly well when you have an accurate teacher with lots of noise-free examples.

However, with sufficient noise in your data or error, or inadequate training samples, ANNs are prone to overfitting (or worse). Techniques such as Early Stopping or Dropout go a long way to avoid overfitting, but they may also restrict the expressive power of neural nets in the process. Much research has gone into improving gradient-based learning rules, and advancements like AdaGrad, RMSProp, AdaDelta, Adam, and (my personal favorite) AdaMax have helped considerably in speeding the learning process. Finally, a relatively recent movement of Batch Normalization has improved the ability to train very deep networks.

With too few (or zero) “correct answers” to compare to, how does one learn? How does a network know that a picture of a cat and its mirror image represent the same cat? In unsupervised learning, we ask our network to compress the input stream to a reduced (and ideally invariant) representation, so as to reduce the dimensionality. Thus the mirror image of a cat could be represented as “cat + mirror” so as not to duplicate the representation (and without also throwing away important information). In addition, the transformed input data will likely require a much smaller model to fit properly, as correlated or causal inputs can be reduced to smaller dimensions. Thus, the reduced dimensionality may require fewer training examples to train an effective model.

For linear models, statisticians and machine learning practitioners will frequently employ Principle Component Analysis (PCA) as a data preprocessing step, in an attempt to reduce the model complexity and available degrees of freedom. This is an example of simple and naive unsupervised learning, where relationships within the input data are revealed and exploited in order to extract a dataset which is easier to model. In more advanced models unsupervised learning may take the form of a Restricted Boltzmann machine or Sparse Autoencoders. Convolution and pooling layers in CNNs could be seen as a type of unsupervised learning, as they strive to create partially translation-invariant representations of the input data. Concepts like Spatial Transformer Networks, Ripple Pond Networks, and Geoff Hinton’s “Capsules” are similar transformative models which promise to be interesting areas of further research.

After transforming the inputs, typically a smaller and simpler model can be used to fit the data (possibly a linear or logistic regression). It has become common practice to combine these steps, for example by using Partial Least Squares (PLS) as an alternative to PCA + Regression. In ANNs, weight initialization using sparse autoencoders has helped to speed learning and avoid local minima. In reservoir computing, inputs are accumulated and aggregated over time in the reservoir, which allows for relatively simple readout models on complex time-varying data streams.

Back to the (efficient) Future

With some background out of the way, we can continue to explore why the third generation of neural networks holds more promise than current state of the art: efficiency. Algorithms of the future will not have the pleasure of sitting in a server farm and crunching calculations through a static image dataset, sorting cats from dogs. They will be expected to be worn on your wrist in remote jungles monitoring for poisonous species, or guiding an autonomous space probe through a distant asteroid field, or swimming through your blood stream taking vital measurements and administering precise amounts of localized medicine to maintain homeostasis through illness.

Algorithms of the future must perform brain-like feats, extrapolating and generalizing from experience, while consuming minimal power, sporting a minimal hardware footprint, and making complex decisions continuously in real time. Compared to the general computing power which is “the brain”, current state of the art methods fall far short in generalized performance, and require much more space, time, and energy. Advances in data and hardware will only improve the situation slightly. Incremental improvements in algorithms can have a great impact on performance, but we’re unlikely to see the gains in efficiency we need without drastic alterations to our methods.

The human brain is super-efficient for a few reasons:

Information transfer is sparse, and carries high information content. Neurons spike only when needed; to maintain a motor command in working memory, transfer identification of a visual clue, or identify an anomalous sound in your auditory pathway.
Information is distributed and robustly represented. Memories can be stored and recalled amid neuron deaths and massive noise. (I recommend the recent works by Jeff Hawkins and Subutai Ahmad to understand the value of a Sparse Distributed Representation)
Information transfer is allowed to be slow. Efficient chemical energy is used whenever possible. The release of neurotransmitters across synaptic channels and the opening of ion channels for membrane potentiation is a much slower but efficient method of energy buildup and transfer. Inefficient (but fast) electrical spikes are only used when information content is sufficiently high and if the signal must travel quickly across a longer distance.
Components are reused within many contexts. A single neuron may be part of hundreds of memories and invariant concepts. Dendritic branches may identify many different types of patterns. Whole sections of the neocortex can be re-purposed when unused (for example, in the brain of a blind person, auditory and other cognitive processing may utilize the cortical sections normally dedicated to vision, thus heightening their other senses and allowing for deeper analysis of non-visual inputs)

Side note: I highly recommend the book “Principles of Neural Design“ by Peter Sterling and Simon Laughlin, as they reverse-engineer neural design, while keeping the topics easily digestible.

The efficiency of time

Morse Code was developed in the 1800’s as a means of communicating over distance using only a single bit of data. At a given moment, the bit is either ON (1) or OFF (0). However, when the fourth dimension (time) is added to the equation, that single bit can be used to express arbitrary language and mathematics. Theoretically, that bit could represent anything in existence (with infinitesimally small intervals and/or an infinite amount of time).

The classic phrase “SOS”, the international code for emergency, is represented in Morse Code by a sequence of ”3 dots, 3 dashes, 3 dots” which could be compactly represented as a binary sequence:

10101 00 11011011 00 10101

Here we see that we can represent “SOS” with 1 bit over 22 time-steps, or equivalently as a static binary sequence with 22 bits. By moving storage and computation from space into time, we drastically change the scope of our problem. Single-bit technologies (for example, a ship’s smoke stack, or a flash light) can now produce complex language when viewed through time. For a given length of time T, and time interval dT, a single bit can represent N (= T / dT) bits when viewed as a sequence through time.

Moving storage and representation from space into time will allow for equivalent representations of data at a fraction of the resources.

Brain vs Computer

Computers (the Von Neumann type) have theoretically infinite computational ability, owing to the idea that they are Turing-Complete. However, they have major flaws of inefficiency:

System memory is moved too coarsely. Memory works in bytes, words, and pages, often requiring excessive storage or memory transfer (for example, when you only need to read 1 bit from memory)
Core data is optimized for worst case requirements. Numeric operations are generally optimized using 64-bit floating point, which is immensely wasteful, especially when many times the value may be zero. (See Unums for an example effort in reducing this particular inefficiency)
All processing is super-fast, even when a result is not needed immediately. Moving fast requires more energy than moving slow. There is only one speed for a CPU, thus most calculations use more energy than required.
Memory can be used for only one value at a time. Once something “fills a slot” in memory, it is unavailable for the rest of the system. This is a problem for maintaining a vast long-term memory store, as generally this requires a copy to (or retrieval from) an external archive (hard disk, cloud storage, etc) which is slow to access and limited in capacity.
Distributed/parallel processing is hard. There are significant bottlenecks in the memory and processing pipelines, and many current algorithms are only modestly parallelizable.

It quickly becomes clear that we need specialized hardware to approach brain-like efficiency (and thus specialized algorithms that can take advantage of this hardware). This hardware must be incredibly flexible; allowing for minimal bit usage, variable processing speed, and highly parallel computations. Memory should be holistically interleaved through the core processing elements, eliminating the need for external memory stores (as well as the bottleneck that is the memory bus). In short, the computers we need look nothing like the computers we have.

Sounds impossible… I give up

Not so fast! There is a long way to go until true AGI is developed, however technological progress tends to be exponential in ways impossible to predict. I feel that we’re close to identifying the core algorithms that drive generalized intelligence. Once identified, alternatives can be developed which make better use of our (inadequate) hardware. Once intelligent algorithms (but still inefficient) can be demonstrated to outperform the current swath of “weak AI” tools in a robust and generalized way, specialized hardware will follow. We live in a Renaissance for AI, where I expect exponential improvements and ground-breaking discoveries in the years to come.

What next?

There are several important areas of research which I feel will contribute to identifying those core algorithms comprising intelligence. I’ll highlight the importance of:

time in networks of neural components. See research from Eugene Izhikevich et al, who is currently applying simulations of cortical networks towards robotics. His research into Polychronous Neuronal Groups (PNG) and applications to memory should open a more rigorous mathematical framework for studying spiking networks. It also supports the importance of delays in synaptic energy transfer as a core piece of knowledge and memory.
non-linear integration within dendritic trees. There is much evidence that the location of a synapse within the dendritic tree of a neuron changes the final impact on somatic membrane potential (which in turn determines if/when a neuron will spike). Synapses closer to the soma contribute directly, while distal synapses may exhibit supralinear (coincident detection) or sublinear (global agreement) integration. I believe that understanding the host of algorithms provided by the many types of dendritic structures will expand the generalizing ability of neural networks. These algorithms are likely the basis of prediction and pattern recognition with historical context. (See Jeff Hawkin’s On Intelligence for a high-level view of prediction in context, or Jakob Hohwy’s The Predictive Mind for a more statistical take.)
diversity and specialization of components. The neocortex is surprisingly uniform given the vast range of abilities, however it is composed of many different types of neurons, chemicals, and structures, all connected through a complex, recurrent, and highly collaborative network of hierarchical layers. The brain has many different components so that each can specialize to fulfill a specific requirement with the proper efficiency. A single type of neuron/dendrite/synapse in isolation cannot have the expressivity (with optimal efficiency) as that of a network with highly diverse and specialized components.
local unsupervised learning. For the brain to make sense out of the massive amounts of sensory input data, it must be able to compress that data locally in a smart way without the guide of a teacher. Local learning likely takes the form of adding and removing synaptic connections and some sort of Hebbian learning (such as Spike-Timing Dependent Plasticity). However in an artificial network, we have the ability to shape the network in ways which are more difficult for biology. Imagine neuronal migration which shifts the synaptic transmission delays between neurons, or re-forming of the dentritic tree structure. These are things which might happen in mammals over generations from evolutionary forces, but which may be powerful learning paradigms for us in a computer on more useful time scales.
global reinforcement learning. Clearly at some point a teacher can help us learn. In humans we produce neurotransmitters, such as dopamine, which adjust the local learning rate on a global scale. This is why Pavlov had success in his experiments: the reward signal (food) had an impact on the whole dog, thus strengthening the recognition of all patterns which consistently preceded the reward. The consistent ringing of a bell allowed the neuronal connections between the sensory pattern “heard a bell” and the motor command “salivate” to strengthen. In this experiment, you may be thinking that this is not a trait we should copy: “The dumb dog gets tricked by the bell!” However, if the pattern is consistent, one is likely able to identify impending reward (or punishment) both more quickly and more efficiently by integrating alternative predictive patterns. If a bell always rings, how does one really know for sure what caused the reward? If the bell is then consistently rung without the presentation of food, the “bell to salivate” connections should subsequently be weakened. Learning is continuous and, in some sense, probabilistic. One global reward signal is an efficient way to adjust the learning, and true causal relationships should win out.

Summary

The human brain is complex and powerful, and the neural networks of today are too inefficient to be the model of future intelligent systems. We must focus energy on new algorithms and network structures, incorporating efficiency through time in novel ways. I plan to expand on some of these topics in future posts, and begin to discuss the components and connectivity that could compose the next generation of neural networks.

juliabloggers.com

A Julia Language Blog Aggregator