This is a demonstration of using JuliaML and TensorFlow to train an LSTM network.
It is based on Aymeric Damien’s LSTM tutorial in Python.
All the explinations are my own, but the code is generally similar in intent.
There are also some differences in terms of network-shape.
The task is to use LSTM to classify MNIST digits.
That is image recognition.
The normal way to solve such problems is a ConvNet.
This is not a sensible use of LSTM, after all it is not a time series task.
The task is made into a time series task, by the images arriving one row at at a time;
and the network is asked to output which class at the end after seeing the 28th row.
So the LSTM network must remember the last 27 prior rows.
This is a toy problem to demonstrate that it can.
To do this we are going to use a bunch of packages from the JuliaML Org, as well as a few others.
A lot of the packages in JuliaML are evolving fast, so somethings here may be wrong.
You can install the packages used in this demo by running:
Pkg.add.(["TensorFlow", "Distributions", "ProgressMeter", "MLLabelUtils", "MLDataUtils"])
,
and Pkg.clone("https://github.com/JuliaML/MLDatasets.jl.git")
.
MLDatasets.jl is not yet registers so you need to clone that one.
Also right now (24/01/2017), we are using the dev branch of MLDataUtils.jl,
so you will need to do the git checkout
stuff to make that work,
but hopefully very soon that will be merged into master, so just the normal Pkg.add
will surfice.
You also need to install TensorFlow, as it is not automatically installed by the TensorFlow.jl package.
We will go through each package we use in turn.
In [1]:
We will begin by defining some of the parameters for our network as constants.
Out network has 28 inputs – one row of pixels, and each image consists of 28 time steps so each row is shown.
The other parameters whould be fairly self explainitory.
In [2]:
We are going to use the MNIST distribution, from the MLDatasets.jl.
It is a handy way to get hold of the data.
The first time you call one of its data functions it will download the data.
After that it will load it from disk.
It is a nice implementation, simply done using file(path) || download(url, path)
at the start of the method.
I would like to implement something similar for CorpusLoaders.jl
We check its shape – the data is a 3D Array, (col,row,item)
, and the labels are integers.
We also define a quick imshow function to draw ascii art so we cha check it out.
In [3]:
We use MLLabelUtils.jl to encode the labels and MLDataUtils.jl to segment the labels and the data into minibatchs. That is how those two packages fit together.
If it applies to only labelled data, eg Encodings then it is done with MLLabelUtils.
If it applies to data in general, eg partitioning the data, the it is done with MLDataUtils.
They are nice stand-alone packages that can be chained in with other JuliaML packages,
or used in a independant system. Which is more like what we are doing here with TensorFlow.jl.
When it comes to encoding the labels, we use convertlabel
from MLLabelUtils.
Its signiture is convertlabel(output_encoding, input_labels, input_encoding)
.
We provide both the desired (output) encoding, and the current (input) encoding.
This ensure that the input is interpretted correctly and constantly.
If we do not provide the input encoding, then MLDataUtils would infer the encoding.
The encoding it would infer (because the input is not strictly positive integers) is that the labels are arbitary.
It would thus devise a NativeLabel
Mapping, based on the order the labels occur in input.
That mapping would not be saved anywhere, so when it comes time to encode the test data, we don’t know which index corresponds to which label symbol.
So we declare the input_label. (Alternatives would be to infor it using labelenc(labels_raw)
and then record the inferred encoding for later. Or to add 1 to all the raw labels so it is in the range 1:10, which causes the labels to be inferred as LabelEnc.Indices{Int64,10}()
)
To break the data down into minibatchs, we use a BatchView
from MLDataUtils.
BatchView is an iterator and efficiently returns it back 1 minibatch at a time.
There are a few requirement on the input iterator,
but a julia Array meets all of them.
It also nicely lets you specify which dimention the observations are on,
but in ourcase it is the last, which is the default.
We will use the data in batchs later, once we have defined the network graph.
In [4]:
Now to define the network graph, this is done using TensorFlow.jl.
TensorFlow is basically a linear algebra tool-kit, featuring automatic differentiation, and optimisation methods.
Which makes it awesome for implementing neural networks.
It does have some neural net specific stuff (A lot of which is in contrib
rather than core src
) such as the LSTMCell,
but it is a lot more general than just neural networks.
It’s more like Theano, than it is like Mocha, Caffe or SKLearn.
This means is actually flexible enough to be useful for (some) machine learning research, rather than only for apply standard networks.
Which is great, because I get tired of doing the backpropergation calculus by hand on weird network topologies.
But today we are just going to use it on a standard kind of network, this LSTM.
We begin by defining out variables in a fairly standard way.
This is very similar to what you would see in a feedward net, see the examples from Tensorflow.jl’s manual.
For our purposes, TensorFlow has 4 kinda of network elements:
- Placeholders, like
X
andY_obs
– these are basically input elements. We declare that this symbol is a Placeholder for data we are going to feed in when werun
the network - Variables, like
W
,B
, and what is hidden inside theLSTMCell
– these are things that can be adjusted during training - Derived Values, like
Y_pred
,cost
,accuracy
,Hs
andx
– these nodes hold the values returned from some operation, they can by your output, or they can be steps in the middle of a chain of such operations. - Action Nodes, like
optimizer
. When these nodes are interacted with (eg Output fromrun
), they do something to the network.optimizer
in our adjusts the Variables to optimise the value of its function input –cost
.
The last two terms, Derived Values and Action Nodes, I made up.
It has how I think of them, but your probably won’t see it in any kind of offical documentation, or in the source code.
So we first declare our inputs as Placeholders.
You will note that they are being sized into Batchs here.
We then define the varaiables W
and B
;
note that we use get_variable
that than declaring it directly,
because in general that is the preferred way, and it lets us use the initializer etc.
We use the Normal distribution as an initialiser. This comes from Distributions.jl.
It is set higher variance than I would normally use, but it seems to work well enough.
In [6]:
We now want to hook the input X
into an RNN, made using LSTMCell
s.
To do this we need the data to be a list of tensors (matrixs since 2),
where
- each element of the list is a different time step, (i.e. a different row of the each image)
- going down the second index of the matrix moves within a single input step (i.e. along the same row of the orginal image)
- and going down the first index puts you on to the next item in the batch.
Initially we have (steps, observations, items)
, we are going to use x
repeatedly as a temporary variable.
We use transpose
to reorder the indexes, so that it is (steps, items, observations)
.
Then reshape
to merge/splice/weave the first two indexes into once index (steps-items, observations)
.
Then spit
to cut along every the first index making a list of tensors [(items1,observations1), (items2,observations2), ...]
.
This feels a bit hacky as a way to do it, but it works.
I note here that transpose
feels a little unidiomatic in particuar, since it ise 0-indexed, and need the cast to Int32 (you’ll get an errror without that), and since the matching julia function is called permutedims
– I would not be surprised if this changed in future versions of TensorFlow.jl.
In [7]:
Now we connect it LSTMcell
and we put that cell into an rnn
.
The LSTMcell
makes up all the LSTM machinery, with forget gates etc,
and the rnn
basically multiplies them and hooks it up to their x
.
It returns the output hidden layers Hs
and the states
.
We don’t really care about the states
but Hs
is a list of Derived Value kind of tensors.
There is one of them for each of the input steps.
We want to hook up only the last one to our next softmax stage, so we do so with Hs[end]
.
Finally we hook up the output layer to get Y_pred
.
Using a fairly standard softmax formulation.
Aymeric Damien’s Python code doesn’t seem to use a softmax output.
I tried without a softmax output and I couldn’t get it to work at all.
Y
This may be to do with the rnn
and LSTMCell
in julia being a little crippled.
They don’t have the full implementation of the Python API.
In particular I couldn’t workout a way to initialise the forget_bias
to one,
so I am not sure if it is not messing it up and becoming a bit unstable at times.
Also, right now there is only support for static rnn
rather than the dynamic_rnn
which all the cool kids apparently use(See rnn vs. dynamic_rnn in this article).
This will probably come in time.
So if all things are correctly setup the shape of the output: Y_pred
should match the shape of the input Y_obs
.
In [8]:
Finally we define the last few nodes of out network.
These are the cost
, for purposes of using to define the optimizer
; and the accuracy
.
The cost is defined using the definition of cross-entropy.
Right now we have to put it in manually, because TensorFlow.jl has not yet implemented that in as nn.nce_loss
(there is just a stub there).
So we use this cross-entropy as the cost function for a AdamOptimizer
, to make out optimizer
node.
We also make a accuracy
node for use during reporting.
This is done by counting the portion of the outputs Y_pred
that match the inputs Y_obs
.
Using the cast-the-boolean-to-a-float-then-take-it’s-mean trick.
Here, it is worth metioning that nodes in tensorflow that are not between the supplied input and the requested output are not evaluated.
This means that if one does run(sess, [optimizer], Dict(X=>xs, Y_obs=>ys))
then the accuracy
node will never be evaluated.
It does not need to be evaluated to get the optimizer
node (but cost
, does).
We will run the network in the next step
In [9]:
Finally we can run our training.
So we go through a zip of traindata and trainlabels we prepared earlier,
run the optimizer on each.
and periodically check the accuracy of that last batch to give status updates.
It is all very nice.
In [10]:
Finally we check how we are going on the test data.
However, as all our nodes have been defined in terms of batch_size,
we are going to need to process the test data in minibatches also.
I feel like these should be a cleaner way to do this that that.
This is a chance to show of the awesomeness that is ProgressMeter.jl @show_progess
.
This displace a unicode-art progress bar, marking progress through the iteration.
Very neat.
In [11]:
90\% accuracy, not bad for an unoptimised network – particularly one as unsuited to the tast as LSTM.
I hope this introduction the JuliaML and TensorFlow has been enlightening.
There is lots of information about TensorFlow online, though to unstand the julia wrapper I had to look at the source-code more ofthen than the its docs. But that will get better with maturity, and the docs line up the the Python API quiet well a lot of the time.
The docs for the new version of MLDataUtils are still being finished off (that is the main blocker on it being merged as I understand it).
Hopefully tuitorials like this lets you see how these all fit together to do something useful.