Tag Archives: Neural Nets

Intro to Sparse Data and Embeddings

Re-posted from: https://tensorflowjulia.blogspot.com/2018/09/intro-to-sparse-data-and-embeddings.html

This is the final exercise of Google’s Machine Learning Crash Course. We use the ACL 2011 IMDB dataset to train a Neural Network in predicting wether a movie review is favourable or not, based on the words used in the review text.

There are two notable differences from the original exercise:

We do not build a proper input pipeline for the data. This creates a lot of computational overhead – in principle, we need to preprocess the whole dataset before we start training the network. In practise, this if often not feasible. It would be interesting to see how such a pipeline can be implemented for TensorFlow.jl. The Julia package MLLabelUtils.jl might come handy for this task.
When visualizing the embedding layer, our Neural Network builds effectively a 1D-representation of keywords to describe if a movie has a favorable review or not. In the Python version, a real 2D embedding is obtained (see the pictures). The reasons for this difference are unknown.

Julia embedding – effectively a 1D line

Python embedding

The Jupyter notebook can be downloaded here.

This notebook is based on the file Embeddings programming exercise, which is part of Google’s Machine Learning Crash Course.

In [0]:

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Intro to Sparse Data and Embeddings

Learning Objectives:

Convert movie-review string data to a feature vector
Implement a sentiment-analysis linear model using a feature vector
Implement a sentiment-analysis DNN model using an embedding that projects data into two dimensions
Visualize the embedding to see what the model has learned about the relationships between words

In this exercise, we’ll explore sparse data and work with embeddings using text data from movie reviews (from the ACL 2011 IMDB dataset). Open and run TFrecord Extraction.ipynb Colaboratory notebook to extract the data from the original .tfrecord file as Julia variables.

Setup

Let’s import our dependencies and open the training and test data. We have exported the test and training data as hdf5 files in the previous step, so we use the HDF5-package to load the data.

In [1]:

using Plots
using Distributions
gr()
using DataFrames
using TensorFlow
import CSV
import StatsBase
using PyCall
@pyimport sklearn.metrics as sklm
using Images
using Colors
using HDF5

sess=Session(Graph())

Out[1]:

Session(Ptr{Void} @0x0000000128c2ac30)

Open the test and training raw data sets.

In [2]:

c = h5open("train_data.h5", "r") do file
   global train_labels=read(file, "output_labels")
   global train_features=read(file, "output_features")
end
c = h5open("test_data.h5", "r") do file
   global test_labels=read(file, "output_labels")
   global test_features=read(file, "output_features")
end
train_labels=train_labels'
test_labels=test_labels';

2018-09-16 16:21:02.045635: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA

Have a look at one data set:

In [4]:

test_labels[301,:]

Out[4]:

1-element Array{Float32,1}:
 1.0

In [5]:

test_features[301]

Out[5]:

"['\"' 'don' \"'\" 't' 'change' 'your' 'husband' '\"' 'is' 'another' 'soap'\n 'opera' 'comedy' 'from' 'producer' '/' 'director' 'cecil' 'b' '.' 'de'\n 'mille' '.' 'it' 'is' 'notable' 'as' 'the' 'first' 'of' 'several' 'films'\n 'he' 'made' 'starring' 'gloria' 'swanson' '.' 'i' 'guess' 'you' 'could'\n 'also' 'call' 'it' 'a' 'sequel' 'of' 'sorts' 'to' 'his' '\"' 'old' 'wives'\n 'for' 'new' '\"' '(' '####' ')' '.' 'james' '(' 'elliot' 'dexter' ')'\n 'and' 'leila' '(' 'swanson' ')' 'porter' 'are' 'a' 'fortyish' 'couple'\n 'where' 'james' 'has' 'gone' 'to' 'seed' 'and' 'become' 'slovenly' 'and'\n 'lazy' '.' 'he' 'has' 'a' 'penchant' 'for' 'smelly' 'cigars' 'and'\n 'eating' 'raw' 'onions' '.' 'he' 'takes' 'his' 'wife' 'for' 'granted' '.'\n 'leila' 'tries' 'to' 'get' 'him' 'to' 'straighten' 'out' 'to' 'no'\n 'avail' '.' 'one' 'night' 'at' 'a' 'dinner' 'party' 'at' 'the' 'porters'\n ',' 'leila' 'meets' 'the' 'dashing' 'schyler' 'van' 'sutphen' '(' 'now'\n \"there's\" 'a' 'moniker' ')' ',' 'the' 'playboy' 'nephew' 'of' 'socialite'\n 'mrs' '.' 'huckney' '(' 'sylvia' 'ashton' ')' '.' 'she' 'invites' 'leila'\n 'to' 'her' 'home' 'for' 'the' 'weekend' 'to' 'make' 'james' '\"' 'miss'\n 'her' '\"' '.' 'once' 'there' 'schyler' 'begins' 'to' 'put' 'the' 'moves'\n 'on' 'her' ',' 'promising' 'her' 'pleasure' ',' 'wealth' 'and' 'love' ','\n 'if' 'she' 'will' 'leave' 'her' 'husband' 'and' 'go' 'with' 'him' '.'\n 'the' 'sequences' 'involving' \"leila's\" 'imagining' 'this' 'promised'\n 'new' 'life' 'are' 'lavishly' 'staged' 'and' 'forecast' 'de' \"mille's\"\n 'epic' 'costume' 'drams' 'later' 'in' 'his' 'career' '.' 'leila' ','\n 'bored' 'with' 'her' 'marriage' 'and' 'her' 'disinterested' 'husband' ','\n 'divorces' 'james' 'and' 'marries' 'the' 'playboy' '.' 'james'\n 'ultimately' 'realizes' 'that' 'he' 'has' 'lost' 'the' 'only' 'thing'\n 'that' 'mattered' 'to' 'him' 'and' 'begins' 'to' 'mend' 'his' 'ways' '.'\n 'he' 'shaves' 'off' 'his' 'mustache' ',' 'works' 'out' ',' 'shuns'\n 'onions' 'and' 'reacquires' 'some' 'manners' '.' 'meanwhile' ',' 'all'\n 'is' 'not' 'rosy' 'with' \"leila's\" 'new' 'marriage' '.' 'schyler' 'it'\n 'seems' 'likes' 'to' 'gamble' 'and' 'has' 'taken' 'up' 'with' 'the'\n 'gold' 'digging' 'nanette' '(' 'aka' 'tootsie' ',' 'or' 'some' 'such'\n 'name' ')' '(' 'julia' 'faye' ')' '.' 'schyler' 'loses' 'all' 'of' 'his'\n 'money' 'and' 'steals' \"leila's\" 'diamond' 'ring' 'to' 'cover' 'his'\n 'losses' '.' 'one' 'fateful' 'day' ',' 'leila' 'meets' 'the' '\"' 'new'\n '\"' 'james' 'and' 'is' 'taken' 'by' 'the' 'changes' 'in' 'him' '.'\n 'james' 'drives' 'her' 'home' 'and' 'becomes' 'aware' 'of' 'her'\n 'situation' 'and' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.'\n '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.'\n '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.' '.'\n 'this' 'film' 'marked' 'the' 'beginning' 'of' 'gloria' \"swanson's\" 'rise'\n 'to' 'super' 'stardom' 'in' 'a' 'career' 'that' 'would' 'rival' 'that'\n 'of' 'mary' 'pickford' '.' 'barely' '##' 'years' 'of' 'age' ',' 'she'\n 'had' 'begun' 'her' 'career' 'in' 'mack' 'sennett' 'two' 'reel'\n 'comedies' 'as' 'a' 'teen' 'ager' '.' 'elliot' 'dexter' 'was' 'almost'\n '##' 'at' 'this' 'time' 'but' 'he' 'and' 'swanson' 'make' 'a' 'good'\n 'team' ',' 'although' \"it's\" 'hard' 'to' 'imagine' 'anyone' 'tiring' 'of'\n 'the' 'lovely' 'miss' 'swanson' 'as' 'is' 'the' 'case' 'in' 'this' 'film'\n '.' 'dexter' 'and' 'sylvia' 'ashton' 'had' 'appeared' 'in' 'the'\n 'similar' '\"' 'old' 'wives' 'for' 'new' '\"' 'where' 'the' 'wife' 'had'\n 'gone' 'to' 'seed' 'and' 'the' 'husband' 'was' 'wronged' '.' 'also' 'in'\n 'the' 'cast' 'are' 'de' 'mille' 'regulars' 'theodore' 'roberts' 'as' 'a'\n 'bishop' 'and' 'raymond' 'hatton' 'as' 'a' 'gambler' '.']"

Building a Sentiment Analysis Model

Let’s train a sentiment-analysis model on this data that predicts if a review is generally favorable (label of 1) or unfavorable (label of 0).

To do so, we’ll turn our string-value terms into feature vectors by using a vocabulary, a list of each term we expect to see in our data. For the purposes of this exercise, we’ve created a small vocabulary that focuses on a limited set of terms. Most of these terms were found to be strongly indicative of favorable or unfavorable, but some were just added because they’re interesting.

Each term in the vocabulary is mapped to a coordinate in our feature vector. To convert the string-value terms for an example into this vector format, we encode such that each coordinate gets a value of 0 if the vocabulary term does not appear in the example string, and a value of 1 if it does. Terms in an example that don’t appear in the vocabulary are thrown away.

NOTE: We could of course use a larger vocabulary, and there are special tools for creating these. In addition, instead of just dropping terms that are not in the vocabulary, we can introduce a small number of OOV (out-of-vocabulary) buckets to which you can hash the terms not in the vocabulary. We can also use a feature hashing approach that hashes each term, instead of creating an explicit vocabulary. This works well in practice, but loses interpretability, which is useful for this exercise.

Building the Input Pipeline

First, let’s configure the input pipeline to import our data into a TensorFlow model. We can use the following function to parse the training and test data and return an array of the features and the corresponding labels.

In [3]:

function create_batches(features, targets, steps, batch_size=5, num_epochs=0)
  """Create batches.

  Args:
    features: Input features.
    targets: Target column.
    steps: Number of steps.
    batch_size: Batch size.
    num_epochs: Number of epochs, 0 will let TF automatically calculate the correct number
  Returns:
    An extended set of feature and target columns from which batches can be extracted.
  """      
    if(num_epochs==0)
        num_epochs=ceil(batch_size*steps/size(features,1))
    end
    
    features_batches=copy(features)
    target_batches=copy(targets)
    
    for i=1:num_epochs        
        select=shuffle(1:size(features,1))
        if i==1
            features_batches=(features[select,:])
            target_batches=(targets[select,:])
        else
            features_batches=vcat(features_batches, features[select,:])
            target_batches=vcat(target_batches, targets[select,:])
        end
    end
    return features_batches, target_batches 
end

Out[3]:

create_batches (generic function with 3 methods)

In [4]:

function construct_feature_columns(input_features):
  """Construct the TensorFlow Feature Columns.

  Args:
    input_features: The numerical input features to use.
  Returns:
    A set of feature columns
  """ 
  out=convert(Array, input_features[:,:])
  return convert.(Float64,out)
end

Out[4]:

construct_feature_columns (generic function with 1 method)

In [5]:

function next_batch(features_batches, targets_batches, batch_size, iter)
  """Next batch.

  Args:
    features_batches: Features batches from create_batches.
    targets_batches: Target batches from create_batches.
    batch_size: Batch size.
    iter: Number of the current iteration
  Returns:
    A batch of features and targets.
  """ 
    select=mod((iter-1)*batch_size+1, size(features_batches,1)):mod(iter*batch_size, size(features_batches,1));

    ds=features_batches[select,:];
    target=targets_batches[select,:];
    
    return ds, target
end

Out[5]:

next_batch (generic function with 1 method)

In [6]:

function my_input_fn(features_batches, targets_batches, iter, batch_size=5, shuffle_flag=1):
    """Prepares a batch of features and labels for model training.
  
    Args:
      features_batches: Features batches from create_batches.
      targets_batches: Target batches from create_batches.
      iter: Number of the current iteration
      batch_size: Batch size.
      shuffle_flag: Determines wether data is shuffled before being returned
    Returns:
      Tuple of (features, labels) for next data batch
    """  

    # Construct a dataset, and configure batching/repeating.
    ds, target = next_batch(features_batches, targets_batches, batch_size, iter)
    
    # Shuffle the data, if specified.
    if shuffle_flag==1
      select=shuffle(1:size(ds, 1));
        ds = ds[select,:]
        target = target[select, :]
    end
    
    # Return the next batch of data.
    return ds, target
end

Out[6]:

my_input_fn (generic function with 3 methods)

Task 1: Use a Linear Model with Sparse Inputs and an Explicit Vocabulary

For our first model, we’ll build a Linear Classifier model using 50 informative terms; always start simple!

The following code constructs the feature column for our terms.

In [7]:

# 50 informative terms that compose our model vocabulary 
informative_terms = ["bad", "great", "best", "worst", "fun", "beautiful",
                     "excellent", "poor", "boring", "awful", "terrible",
                     "definitely", "perfect", "liked", "worse", "waste",
                     "entertaining", "loved", "unfortunately", "amazing",
                     "enjoyed", "favorite", "horrible", "brilliant", "highly",
                     "simple", "annoying", "today", "hilarious", "enjoyable",
                     "dull", "fantastic", "poorly", "fails", "disappointing",
                     "disappointment", "not", "him", "her", "good", "time",
                     "?", ".", "!", "movie", "film", "action", "comedy",
                     "drama", "family"]

Out[7]:

50-element Array{String,1}:
 "bad"       
 "great"     
 "best"      
 "worst"     
 "fun"       
 "beautiful" 
 "excellent" 
 "poor"      
 "boring"    
 "awful"     
 "terrible"  
 "definitely"
 "perfect"   
 ⋮           
 "her"       
 "good"      
 "time"      
 "?"         
 "."         
 "!"         
 "movie"     
 "film"      
 "action"    
 "comedy"    
 "drama"     
 "family"

The following function takes the input data and vocabulary and converts the data to a one-hot encoded matrix.

In [8]:

# function for creating categorial colum from vocabulary list in one hot encoding
function create_data_columns(data, informative_terms)
   onehotmat=zeros(length(data), length(informative_terms))
   
    for i=1:length(data)
        string=data[i]
        for j=1:length(informative_terms)
            if contains(string, informative_terms[j])
                onehotmat[i,j]=1
            end
        end
    end
    return onehotmat
end

Out[8]:

create_data_columns (generic function with 1 method)

In [9]:

train_feature_mat=create_data_columns(train_features, informative_terms)
test_features_mat=create_data_columns(test_features, informative_terms);

Next, we’ll construct the Linear Classifier model, train it on the training set, and evaluate it on the evaluation set. After you read through the code, run it and see how you do.

In [10]:

function train_linear_classifier_model(learning_rate,
                     steps, 
                     batch_size, 
                     training_examples, 
                     training_targets, 
                     validation_examples, 
                     validation_targets)
  """Trains a linear classifier model.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    training_examples, etc: The input data.
  Returns:
    weight: The weights of the linear model.
    bias: The bias of the linear model.
    validation_probabilities: Probabilities for the validation examples.
    p1: Plot of loss function for the different periods
  """
  
  periods = 10
  steps_per_period = steps / periods

  # Create feature columns.
  feature_columns = placeholder(Float32)
  target_columns = placeholder(Float32)
  eps=1E-8
  
  # these two variables need to be initialized as 0, otherwise method gives problems 
  m=Variable(zeros(size(training_examples,2),1).+0.0)
  b=Variable(0.0)

  ytemp=nn.sigmoid(feature_columns*m + b)
  y= clip_by_value(ytemp, 0.0, 1.0)
  loss = -reduce_mean(log(y+eps).*target_columns + log(1-y-eps).*(1-target_columns)) 

  features_batches, targets_batches = create_batches(training_examples, training_targets, steps, batch_size)
    
  # Advanced Adam optimizer decent with gradient clipping
  my_optimizer=(train.AdamOptimizer(learning_rate))
  gvs = train.compute_gradients(my_optimizer, loss)
  capped_gvs = [(clip_by_norm(grad, 5.0), var) for (grad, var) in gvs]
  my_optimizer = train.apply_gradients(my_optimizer,capped_gvs)
    
  run(sess, global_variables_initializer()) #this needs to be run after constructing the optimizer!
    
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  println("Training model...")
  println("LogLoss (on training data):")
  training_log_losses = []
  validation_log_losses=[]
  for period in 1:periods
    # Train the model, starting from the prior state.
    for i=1:steps_per_period
      features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
      run(sess, my_optimizer, Dict(feature_columns=>construct_feature_columns(features), target_columns=>construct_feature_columns(labels)))
    end
    # Take a break and compute predictions.
    training_probabilities = run(sess, y, Dict(feature_columns=> construct_feature_columns(training_examples)));    
    validation_probabilities = run(sess, y, Dict(feature_columns=> construct_feature_columns(validation_examples)));  
        
    # Compute loss.
    training_log_loss=run(sess,loss,Dict(feature_columns=> construct_feature_columns(training_examples), target_columns=>construct_feature_columns(training_targets)))
    validation_log_loss =run(sess,loss,Dict(feature_columns=> construct_feature_columns(validation_examples), target_columns=>construct_feature_columns(validation_targets)))
        
    # Occasionally print the current loss.
    println("  period ", period, ": ", training_log_loss)
    weight = run(sess,m)
    bias = run(sess,b)

    loss_val=run(sess,loss,Dict(feature_columns=> construct_feature_columns(training_examples), target_columns=>construct_feature_columns(training_targets)))
        
    # Add the loss metrics from this period to our list.
    push!(training_log_losses, training_log_loss)
    push!(validation_log_losses, validation_log_loss)
  end

  weight = run(sess,m)
  bias = run(sess,b)
  
  println("Model training finished.")

  # Output a graph of loss metrics over periods.
  p1=plot(training_log_losses, label="training", title="LogLoss vs. Periods", ylabel="LogLoss", xlabel="Periods")
  p1=plot!(validation_log_losses, label="validation")
    
  println("Final LogLoss (on training data): ", training_log_losses[end])

  # calculate additional ouputs
  validation_probabilities = run(sess, y, Dict(feature_columns=> construct_feature_columns(validation_examples)));    
    
  return weight, bias, validation_probabilities, p1  
end

Out[10]:

train_linear_classifier_model (generic function with 1 method)

In [14]:

weight, bias, validation_probabilities,  p1 = train_linear_classifier_model(
    0.0005, #learning rate
    1000, #steps
    50, #batch_size
    train_feature_mat,
    train_labels,
    test_features_mat,
    test_labels)

Training model...
LogLoss (on training data):
  period 1: 0.6716383366395725
  period 2: 0.6520718400586106
  period 3: 0.6347970177101597
  period 4: 0.6191596702328207
  period 5: 0.6047782721033566
  period 6: 0.5922131685262155
  period 7: 0.580839687465644
  period 8: 0.5702423812570189
  period 9: 0.5607007472318791
  period 10: 0.5519328267238274
Model training finished.

Out[14]:

([-0.36884; 0.331766; … ; 0.127656; 0.181674], 0.038435446715259586, [0.590407; 0.390335; … ; 0.662796; 0.529415], Plot{Plots.GRBackend() n=2})

Final LogLoss (on training data): 0.5519328267238274

In [15]:

plot(p1)

Out[15]:

The following function converts the validation probabilites back to 0-1-predictions.

In [13]:

# Function for converting probabilities to 0/1 decision
function castto01(probabilities)
    out=copy(probabilities)
    for i=1:length(probabilities)
        if(probabilities[i]<0.5)
            out[i]=0
        else
            out[i]=1
        end
    end
    return out
end

Out[13]:

castto01 (generic function with 1 method)

Let’s have a look at the accuracy of the model:

In [17]:

evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
    vec(construct_feature_columns(test_labels)), vec(validation_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(validation_probabilities))
validation_predictions=castto01(validation_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");

In [18]:

println("AUC on the validation set: ",  evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])

AUC on the validation set: [0.865503]
Accuracy on the validation set: [0.781591]

In [19]:

plot(p2)

Out[19]:

Task 2: Use a Deep Neural Network (DNN) Model

The above model is a linear model. It works quite well. But can we do better with a DNN model?

Let’s constructa NN classification model. Run the following cells, and see how you do.

In [11]:

function train_nn_classification_model(learning_rate,
                     steps, 
                     batch_size, 
                     hidden_units,
                     is_embedding,
                     keep_probability,
                     training_examples, 
                     training_targets, 
                     validation_examples, 
                     validation_targets)
  """Trains a neural network classification model.
  
  Args:
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    hidden_units: A vector describing the layout of the neural network.
    is_embedding: 'true' or 'false' depending on if the first layer of the NN is an embedding layer.
    keep_probability: A `float`, the probability of keeping a node active during one training step.
  Returns:
    p1: Plot of the loss function for the different periods.
    y: The final layer of the TensorFlow network.
    final_probabilities: Final predicted probabilities on the validation examples.
    weight_export: The weights of the first layer of the NN
    feature_columns: TensorFlow feature columns.
    target_columns: TensorFlow target columns.
  """
  
  periods = 10
  steps_per_period = steps / periods

  # Create feature columns.
  feature_columns = placeholder(Float32, shape=[-1, size(training_examples,2)])
  target_columns = placeholder(Float32, shape=[-1, size(training_targets,2)])
        
  # Network parameters
  push!(hidden_units,size(training_targets,2)) #create an output node that fits to the size of the targets
  activation_functions = Vector{Function}(size(hidden_units,1))
  activation_functions[1:end-1]=z->nn.dropout(nn.relu(z), keep_probability)
  activation_functions[end] = nn.sigmoid #Last function should be idenity as we need the logits  
    
  # create network 
  flag=0
  weight_export=Variable([1])
  Zs = [feature_columns]

  for (ii,(hlsize, actfun)) in enumerate(zip(hidden_units, activation_functions))
        Wii = get_variable("W_$ii"*randstring(4), [get_shape(Zs[end], 2), hlsize], Float32)
        bii = get_variable("b_$ii"*randstring(4), [hlsize], Float32)
        
        if((is_embedding==true) & (flag==0))
            Zii=Zs[end]*Wii 
        else
            Zii = actfun(Zs[end]*Wii + bii)
        end
        push!(Zs, Zii)
        
        if(flag==0)
            weight_export=Wii
            flag=1
        end
  end

  y=Zs[end]
  eps=1e-8
  cross_entropy = -reduce_mean(log(y+eps).*target_columns + log(1-y+eps).*(1-target_columns))
 
  features_batches, targets_batches = create_batches(training_examples, training_targets, steps, batch_size)
  
  # Standard Adam Optimizer
  my_optimizer=train.minimize(train.AdamOptimizer(learning_rate), cross_entropy)
 
  run(sess, global_variables_initializer())

    
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  println("Training model...")
  println("LogLoss error (on validation data):")
  training_log_losses = []
  validation_log_losses = []
  for period in 1:periods  
    # Train the model, starting from the prior state.
   for i=1:steps_per_period
    features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
    run(sess, my_optimizer, Dict(feature_columns=>construct_feature_columns(features), target_columns=>construct_feature_columns(labels)))
   end
    # Take a break and compute log loss.
    training_log_loss = run(sess, cross_entropy, Dict(feature_columns=> construct_feature_columns(training_examples), target_columns=>construct_feature_columns(training_targets)));    
    validation_log_loss = run(sess, cross_entropy, Dict(feature_columns=> construct_feature_columns(validation_examples), target_columns=>construct_feature_columns(validation_targets)));  
         
    # Occasionally print the current loss.
    println("  period ", period, ": ", training_log_loss)
           
    # Add the loss metrics from this period to our list.
    push!(training_log_losses, training_log_loss)
    push!(validation_log_losses, validation_log_loss)
  end      
        
        
  println("Model training finished.")
  
  # Calculate final predictions (not probabilities, as above).
  final_probabilities = run(sess, y, Dict(feature_columns=> validation_examples, target_columns=>validation_targets))
  final_predictions=0.0.*copy(final_probabilities)
  final_predictions=castto01(final_probabilities)
  
  accuracy = sklm.accuracy_score(validation_targets, final_predictions)
  println("Final accuracy (on validation data): ", accuracy)

  # Output a graph of loss metrics over periods.
  p1=plot(training_log_losses, label="training", title="LogLoss vs. Periods", ylabel="LogLoss", xlabel="Periods")
  p1=plot!(validation_log_losses, label="validation")
  
  return p1, y, final_probabilities, weight_export, feature_columns, target_columns
end

Out[11]:

train_nn_classification_model (generic function with 1 method)

In [21]:

sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
    0.003, #learning rate
    1000, #steps
    50, #batch_size
    [20, 20], #hidden_units
    false, #is_embedding
    1.0, # keep probability
    train_feature_mat,
    train_labels,
    test_features_mat,
    test_labels)

Training model...
LogLoss error (on validation data):
  period 1: 0.471395788925613
  period 2: 0.45770820644301585
  period 3: 0.44868498081679836
  period 4: 0.4486270877372173
  period 5: 0.44886359529693076
  period 6: 0.45785272663956894
  period 7: 0.4486596600689161
  period 8: 0.44640892834409557
  period 9: 0.44672501642012086
  period 10: 0.44597568632613904
Model training finished.

Out[21]:

(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.759592; 0.194817; … ; 0.910422; 0.592667], TensorFlow.Variables.Variable{Float32}(<Tensor W_1BDge:1 shape=(50, 20) dtype=Float32>, <Tensor W_1BDge/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 50) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)

Final accuracy (on validation data): 0.7854314172566903

In [22]:

plot(p1)

Out[22]:

In [23]:

evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
    vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ",  evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])

AUC on the validation set: [0.871963]
Accuracy on the validation set: [0.785431]

In [24]:

plot(p2)

Out[24]:

Task 3: Use an Embedding with a DNN Model

In this task, we’ll implement our DNN model using an embedding column. An embedding column takes sparse data as input and returns a lower-dimensional dense vector as output. We’ll add the embedding layer as the first layer in the hidden_units-vector, and set is_embedding to true.

NOTE: In practice, we might project to dimensions higher than 2, like 50 or 100. But for now, 2 dimensions is easy to visualize.

In [14]:

sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
    0.003, #learning rate
    1000, #steps
    50, #batch_size
    [2, 20, 20], #hidden_units
    true,
    1.0, # keep probability
    train_feature_mat,
    train_labels,
    test_features_mat,
    test_labels)

Training model...
LogLoss error (on validation data):
  period 1: 0.5663698194950431
  period 2: 0.45220806683037423
  period 3: 0.45205136369869175
  period 4: 0.44547315482684624
  period 5: 0.4489593760718842
  period 6: 0.4484996714455285
  period 7: 0.44584040864739305
  period 8: 0.4457339047966633
  period 9: 0.4464958611862487
  period 10: 0.4485086547636147
Model training finished.

Out[14]:

(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.736896; 0.157856; … ; 0.895887; 0.54718], TensorFlow.Variables.Variable{Float32}(<Tensor W_1fwgD:1 shape=(50, 2) dtype=Float32>, <Tensor W_1fwgD/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 50) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)

In [15]:

plot(p1)

Out[15]:

Final accuracy (on validation data): 0.7846313852554102

In [16]:

evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
    vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ",  evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])

AUC on the validation set: [0.873001]
Accuracy on the validation set: [0.784631]

In [17]:

plot(p2)

Out[17]:

Task 4: Examine the Embedding

Let’s now take a look at the actual embedding space, and see where the terms end up in it. Do the following:

Run the following code to see the embedding we trained in Task 3. Do things end up where you’d expect?
Re-train the model by rerunning the code in Task 3, and then run the embedding visualization below again. What stays the same? What changes?
Finally, re-train the model again using only 10 steps (which will yield a terrible model). Run the embedding visualization below again. What do you see now, and why?

In [18]:

xy_coord=run(sess, weight_export, Dict(feature_columns=> test_features_mat, target_columns=>test_labels))
p3=plot(title="Embedding Space", xlims=(minimum(xy_coord[:,1])-0.3, maximum(xy_coord[:,1])+0.3),  ylims=(minimum(xy_coord[:,2])-0.1, maximum(xy_coord[:,2]) +0.3)  )
for term_index=1:length(informative_terms)
    p3=annotate!(xy_coord[term_index,1], xy_coord[term_index,1], informative_terms[term_index] )
end
plot(p3)

Out[18]:

Task 5: Try to improve the model’s performance

See if you can refine the model to improve performance. A couple things you may want to try:

Changing hyperparameters, or using a different optimizer than Adam (you may only gain one or two accuracy percentage points following these strategies).
Adding additional terms to informative_terms. There’s a full vocabulary file with all 30,716 terms for this data set that you can use at: https://storage.googleapis.com/mledu-datasets/sparse-data-embedding/terms.txt You can pick out additional terms from this vocabulary file, or use the whole thing.

In the following code, we will import the whole vocabulary file and run the model with it.

In [30]:

vocabulary=Array{String}(0)
open("terms.txt") do file
    for ln in eachline(file)
        push!(vocabulary, ln)
    end
end

In [31]:

vocabulary

Out[31]:

30716-element Array{String,1}:
 "the"      
 "."        
 ","        
 "and"      
 "a"        
 "of"       
 "to"       
 "is"       
 "in"       
 "i"        
 "it"       
 "this"     
 "'"        
 ⋮          
 "soapbox"  
 "softening"
 "user's"   
 "od"       
 "potter's" 
 "renard"   
 "impacting"
 "pong"     
 "nobly"    
 "nicol"    
 "ff"       
 "MISSING"

We will now load the test and training features matrices from disk. Open and run the Conversion of Movie-review data to one-hot encoding-notebook to prepare the IMDB_fullmatrix_datacolumns.jld-file. The notebook can be found here.

In [48]:

using JLD
train_features_full=load("IMDB_fullmatrix_datacolumns.jld", "train_features_full")
test_features_full=load("IMDB_fullmatrix_datacolumns.jld", "test_features_full")

Out[48]:

24999×30716 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0  0.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                        ⋮              ⋱       ⋮                        ⋮  
 0.0  1.0  1.0  0.0  1.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  1.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0

Now run the session with the full vocabulary file. Again, this will take a long time to finish. It assigns about 50GB of memory.

In [49]:

sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
    # TWEAK THESE VALUES TO SEE HOW MUCH YOU CAN IMPROVE THE RMSE
    0.003, #learning rate
    1000, #steps
    50, #batch_size
    [2, 20, 20], #hidden_units
    true,
    1.0, # keep probability
    train_features_full,
    train_labels,
    test_features_full,
    test_labels)

Training model...
LogLoss error (on validation data):
  period 1: 0.37313920231826037
  period 2: 0.26477444394028044
  period 3: 0.25086412221834486
  period 4: 0.2087771610933073
  period 5: 0.19335710666766756
  period 6: 0.16094674715105722
  period 7: 0.15969771663371266
  period 8: 0.14275753878385306
  period 9: 0.13488925137953908
  period 10: 0.11471432399972212
Model training finished.

Out[49]:

(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.0975639; 0.0719205; … ; 0.993713; 0.678265], TensorFlow.Variables.Variable{Float32}(<Tensor W_1tR5j:1 shape=(30716, 2) dtype=Float32>, <Tensor W_1tR5j/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 30716) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)

Final accuracy (on validation data): 0.8656746269850794

In [4]:

#plot(p1)

In [51]:

evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
    vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ",  evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])

AUC on the validation set: [0.939391]
Accuracy on the validation set: [0.865675]

In [3]:

#plot(p2)

Task 6: Try out sparse matrices

We will now convert the feature matrices from the previous step to sparse matrices and re-run our code. The sparse matrices take about 350MB of memory. The code for the NN will still convert the sparse matrix containing the data for the current batch to a full matrix, which leads to a memory requirement of about 35GB.

In [54]:

train_features_sparse=sparse(train_features_full)
test_features_sparse=sparse(test_features_full)

Out[54]:

24999×30716 SparseMatrixCSC{Float64,Int64} with 10780292 stored entries:
  [1    ,     1]  =  1.0
  [2    ,     1]  =  1.0
  [3    ,     1]  =  1.0
  [4    ,     1]  =  1.0
  [5    ,     1]  =  1.0
  [6    ,     1]  =  1.0
  [7    ,     1]  =  1.0
  [8    ,     1]  =  1.0
  [9    ,     1]  =  1.0
  [10   ,     1]  =  1.0
  ⋮
  [24977, 30715]  =  1.0
  [24978, 30715]  =  1.0
  [24979, 30715]  =  1.0
  [24981, 30715]  =  1.0
  [24982, 30715]  =  1.0
  [24983, 30715]  =  1.0
  [24984, 30715]  =  1.0
  [24989, 30715]  =  1.0
  [24990, 30715]  =  1.0
  [24993, 30715]  =  1.0
  [24996, 30715]  =  1.0

In [56]:

# For saving the data
#save("IMDB_sparsematrix_datacolumns.jld", "train_features_sparse", train_features_sparse, "test_features_sparse", test_features_sparse)

In [55]:

sess=Session(Graph())
p1, y, final_probabilities, weight_export, feature_columns, target_columns = train_nn_classification_model(
    0.003, #learning rate
    1000, #steps
    50, #batch_size
    [2, 20, 20], #hidden_units
    true,
    1.0, # keep probability
    train_features_sparse,
    train_labels,
    test_features_sparse,
    test_labels)

Training model...
LogLoss error (on validation data):
  period 1: 0.4264791202329761
  period 2: 0.27906764597962996
  period 3: 0.24588130824603802
  period 4: 0.24137786867062228
  period 5: 0.1982403807820223
  period 6: 0.195120145753206
  period 7: 0.15590201135616108
  period 8: 0.145190807054994
  period 9: 0.128827185543495
  period 10: 0.12583784552275887
Model training finished.

Out[55]:

(Plot{Plots.GRBackend() n=2}, <Tensor Sigmoid:1 shape=(?, 1) dtype=Float32>, Float32[0.302512; 0.50181; … ; 0.977541; 0.652478], TensorFlow.Variables.Variable{Float32}(<Tensor W_1gvhS:1 shape=(30716, 2) dtype=Float32>, <Tensor W_1gvhS/Assign:1 shape=unknown dtype=Float32>), <Tensor placeholder:1 shape=(?, 30716) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 1) dtype=Float32>)

Final accuracy (on validation data): 0.8696347853914157

In [2]:

#plot(p1)

In [57]:

evaluation_metrics=DataFrame()
false_positive_rate, true_positive_rate, thresholds = sklm.roc_curve(
    vec(construct_feature_columns(test_labels)), vec(final_probabilities))
evaluation_metrics[:auc]=sklm.roc_auc_score(construct_feature_columns(test_labels), vec(final_probabilities))
validation_predictions=castto01(final_probabilities);
evaluation_metrics[:accuracy]=accuracy = sklm.accuracy_score(test_labels, validation_predictions)

p2=plot(false_positive_rate, true_positive_rate, label="our model")
p2=plot!([0, 1], [0, 1], label="random classifier");
println("AUC on the validation set: ",  evaluation_metrics[:auc])
println("Accuracy on the validation set: ", evaluation_metrics[:accuracy])

AUC on the validation set: [0.940761]
Accuracy on the validation set: [0.869635]

In [1]:

#plot(p2)

A Final Word

We may have gotten a DNN solution with an embedding that was better than our original linear model, but the linear model was also pretty good and was quite a bit faster to train. Linear models train more quickly because they do not have nearly as many parameters to update or layers to backprop through.

In some applications, the speed of linear models may be a game changer, or linear models may be perfectly sufficient from a quality standpoint. In other areas, the additional model complexity and capacity provided by DNNs might be more important. When defining your model architecture, remember to explore your problem sufficiently so that you know which space you’re in.

Classifying Handwritten Digits with Neural Networks

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2018/09/classifying-handwritten-digits-with.html

In this exercise, we look at the famous MNIST handwritten digit classification problem. Using the MNIST.jl package makes it easy to access the image samples from Julia. Similar to the logistic regression exercise, we use PyCall and scikit-learn‘s metrics for easy calculation of neural network accuracy and confusion matrices.

We will also visualize the first layer of the neural network to get an idea of how it “sees” handwritten digits. The part of the code that creates a 10×10 grid of plots is rather handwaving – if someone has an idea about how to properly set up programmatic generation and display of plots in Julia, I would be very interested.

The Jupyter notebook can be downloaded here.

This notebook is based on the file MNIST Digit Classification programming exercise, which is part of Google’s Machine Learning Crash Course.

In [0]:

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Classifying Handwritten Digits with Neural Networks¶

Learning Objectives:

Train both a linear model and a neural network to classify handwritten digits from the classic MNIST data set
Compare the performance of the linear and neural network classification models
Visualize the weights of a neural-network hidden layer

Our goal is to map each input image to the correct numeric digit. We will create a NN with a few hidden layers and a Softmax layer at the top to select the winning class.

Setup

First, let’s load the data set, import TensorFlow and other utilities, and load the data into a DataFrame. Note that this data is a sample of the original MNIST training data.

In [1]:

using Plots
using Distributions
gr()
using DataFrames
using TensorFlow
import CSV
import StatsBase
using PyCall
@pyimport sklearn.metrics as sklm
using Images
using Colors

sess=Session(Graph())

Out[1]:

Session(Ptr{Void} @0x0000000124e67ab0)

We use the MNIST.jl package for accessing the dataset. The functions for loading the data and creating batches follow its documentation.

In [10]:

using MNIST

mutable struct DataLoader
    cur_id::Int
    order::Vector{Int}
end

DataLoader() = DataLoader(1, shuffle(1:60000))
loader=DataLoader()

function next_batch(loader::DataLoader, batch_size)
    features = zeros(Float32, batch_size, 784)
    labels = zeros(Float32, batch_size, 10)
    for i in 1:batch_size
        features[i, :] = trainfeatures(loader.order[loader.cur_id])./255.0
        label = trainlabel(loader.order[loader.cur_id])
        labels[i, Int(label)+1] = 1.0
        loader.cur_id += 1
        if loader.cur_id > 60000
            loader.cur_id = 1
        end
    end
    features, labels
end

function load_test_set(N=10000)
    features = zeros(Float32, N, 784)
    labels = zeros(Float32, N, 10)
    for i in 1:N
        features[i, :] = testfeatures(i)./255.0
        label = testlabel(i)
        labels[i, Int(label)+1] = 1.0
    end
    features,labels
end

Out[10]:

load_test_set (generic function with 2 methods)

labels represents the label that a human rater has assigned for one handwritten digit. The ten digits 0-9 are each represented, with a unique class label for each possible digit. Thus, this is a multi-class classification problem with 10 classes.

The variable features contains the feature values, one per pixel for the 28×28=784 pixel values. The pixel values are on a gray scale in which 0 represents white, 255 represents black, and values between 0 and 255 represent shades of gray. Most of the pixel values are 0; you may want to take a minute to confirm that they aren’t all 0. For example, adjust the following text block to print out the features and labels for dataset 72.

In [4]:

trainfeatures(72)

Out[4]:

784-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮  
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

In [15]:

trainlabel(72)

Out[15]:

7.0

Now, let’s parse out the labels and features and look at a few examples. Show a random example and its corresponding label:

In [3]:

rand_number=rand(1:60000)
rand_example_features = trainfeatures(rand_number)
img=colorview(Gray,1.-reshape(rand_example_features, (28, 28)))
rand_example_label=trainlabel(rand_number)
println("Label: ",rand_example_label)
img

Out[3]:

Label: 1.0

In [4]:

p1=heatmap(flipdim(1.-reshape(rand_example_features, (28, 28)),1), legend=:none, c=:gray, title="Label: $rand_example_label")

Out[4]:

WARNING: gray is found in more than one library: cmocean, colorcet. Choosing cmocean

The following functions normalize the features and convert the targets to a one-hot encoding. For example, if the variable contains ‘1’ in column 5, then a human rater interpreted the handwritten character as the digit ‘6’.

In [5]:

function preprocess_features(data_range)
    examples = zeros(Float32, length(data_range), 784)
    for i in 1:length(data_range)
        examples[i, :] = testfeatures(i)./255.0
    end
    return examples
end


function preprocess_targets(data_range)
    targets = zeros(Float32, length(data_range), 10)
    for i in 1:length(data_range)
        label = testlabel(i)
        targets[i, Int(label)+1] = 1.0
    end
    return targets
end

Out[5]:

preprocess_targets (generic function with 1 method)

Let’s devide the first 10000 datasets into training and validation examples.

In [8]:

training_examples = preprocess_features(1:7500)
training_targets = preprocess_targets(1:7500)

validation_examples=preprocess_features(7501:10000)
validation_targets=preprocess_targets(7501:10000);

The following function converts the predicted labels (in one-hot encoding) back to a numerical label from 0 to 9.

In [6]:

function to1col(targets)
    reduced_targets=zeros(size(targets,1),1)
    for i=1:size(targets,1)
        reduced_targets[i]=sum( collect(0:size(targets,2)-1).*targets[i,:])  
    end
    return reduced_targets
end

Out[6]:

to1col (generic function with 1 method)

Task 1: Build a Linear Model for MNIST

First, let’s create a baseline model to compare against. You’ll notice that in addition to reporting accuracy, and plotting Log Loss over time, we also display a confusion matrix. The confusion matrix shows which classes were misclassified as other classes. Which digits get confused for each other? Also note that we track the model’s error using the log_loss function.

In [11]:

function train_linear_classification_model(
    learning_rate,
    steps,
    batch_size,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets)
  """Trains a linear classification model for the MNIST digits dataset.
  
  In addition to training, this function also prints training progress information,
  a plot of the training and validation loss over time, and a confusion
  matrix.
  
  Args:
    learning_rate: An `int`, the learning rate to use.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    training_examples: An `Array` containing the training features.
    training_targets: An `Array` containing the training labels.
    validation_examples: An `Array` containing the validation features.
    validation_targets: An `Array` containing the validation labels.
      
  Returns:
    p1: Plot of loss metrics
    p2: Plot of confusion matrix
  """

  periods = 10

  steps_per_period = steps / periods  
  
  # Create feature columns
  feature_columns = placeholder(Float32)
  target_columns = placeholder(Float32)
    
  # Create network
  W = Variable(zeros(Float32, 784, 10))
  b = Variable(zeros(Float32, 10)) 
  
  y = nn.softmax(feature_columns*W + b)
  cross_entropy = reduce_mean(-reduce_sum(target_columns .* log(y), axis=[2]))
    
  # Gradient decent with gradient clipping
  my_optimizer=(train.AdamOptimizer(learning_rate))
  gvs = train.compute_gradients(my_optimizer, cross_entropy)
  capped_gvs = [(clip_by_norm(grad, 5.), var) for (grad, var) in gvs]
  my_optimizer = train.apply_gradients(my_optimizer,capped_gvs)
    
  run(sess, global_variables_initializer())

    
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  println("Training model...")
  println("LogLoss error (on validation data):")
  training_errors = []
  validation_errors = []
  for period in 1:periods
    for i=1:steps_per_period
        
    # Train the model, starting from the prior state.
    features_batches, targets_batches = next_batch(loader, batch_size)
    run(sess, my_optimizer, Dict(feature_columns=>features_batches, target_columns=>targets_batches))    
    end
  
    # Take a break and compute probabilities.
    training_predictions = run(sess, y, Dict(feature_columns=> training_examples, target_columns=>training_targets)) 
    validation_predictions = run(sess, y, Dict(feature_columns=> validation_examples, target_columns=>validation_targets))
    
    # Compute training and validation errors.
    training_log_loss = sklm.log_loss(training_targets, training_predictions)
    validation_log_loss = sklm.log_loss(validation_targets, validation_predictions)
    # Occasionally print the current loss.
    println("  period ", period, ": ",validation_log_loss)
    # Add the loss metrics from this period to our list.
    push!(training_errors, training_log_loss)
    push!(validation_errors, validation_log_loss)
  end      
        
        
  println("Model training finished.")
  
  # Calculate final predictions (not probabilities, as above).
  final_probabilities = run(sess, y, Dict(feature_columns=> validation_examples, target_columns=>validation_targets))
  
  final_predictions=0.0.*copy(final_probabilities)
  for i=1:size(final_predictions,1)
        final_predictions[i,indmax(final_probabilities[i,:])]=1.0
  end
  
  accuracy = sklm.accuracy_score(validation_targets, final_predictions)
  println("Final accuracy (on validation data): ", accuracy)

  # Output a graph of loss metrics over periods.
  p1=plot(training_errors, label="training", title="LogLoss vs. Periods", ylabel="LogLoss", xlabel="Periods")
  p1=plot!(validation_errors, label="validation")
  
  # Output a plot of the confusion matrix.
  cm = sklm.confusion_matrix(to1col(validation_targets), to1col(final_predictions))
  # Normalize the confusion matrix by row (i.e by the number of samples
  # in each class).
  cm_normalized=convert.(Float32,copy(cm))
  for i=1:size(cm,1)
     cm_normalized[i,:]=cm[i,:]./sum(cm[i,:])
  end
  p2 = heatmap(cm_normalized, c=:dense, title="Confusion Matrix", ylabel="True label", xlabel= "Predicted label", xticks=(1:10, 0:9), yticks=(1:10, 0:9))

  return p1, p2
end

Out[11]:

train_linear_classification_model (generic function with 1 method)

Spend 5 minutes seeing how well you can do on accuracy with a linear model of this form. For this exercise, limit yourself to experimenting with the hyperparameters for batch size, learning rate and steps.

In [12]:

p1, p2 = train_linear_classification_model(
    0.02,#learning rate
    100, #steps
    10, #batch_size
    training_examples,
    training_targets,
    validation_examples,
    validation_targets)

Training model...
LogLoss error (on validation data):
  period 1: 1.5077540435312957
  period 2: 1.0670072549042842
  period 3: 0.7679461688013705
  period 4: 0.8279036009749534
  period 5: 0.847797180938959
  period 6: 0.688936055092166
  period 7: 0.7574307274848022
  period 8: 0.7252071137057945
  period 9: 0.7101044422048004
  period 10: 0.6226804575660101
Model training finished.

Out[12]:

(Plot{Plots.GRBackend() n=2}, Plot{Plots.GRBackend() n=1})

Final accuracy (on validation data): 0.8336

In [13]:

plot(p1)

Out[13]:

In [14]:

plot(p2)

Out[14]:

Here is a set of parameters that should attain roughly 0.9 accuracy.

In [15]:

sess=Session(Graph())
p1, p2 = train_linear_classification_model(
    0.003,#learning rate
    1000, #steps
    30, #batch_size
    training_examples,
    training_targets,
    validation_examples,
    validation_targets)

Training model...
LogLoss error (on validation data):
  period 1: 0.6256736705787945
  period 2: 0.5339106926386972
  period 3: 0.47617202772979506
  period 4: 0.4398987464382371
  period 5: 0.42111407697942305
  period 6: 0.41976561313078276
  period 7: 0.41394242923204144
  period 8: 0.3934528665583277
  period 9: 0.3831627080338039

Out[15]:

(Plot{Plots.GRBackend() n=2}, Plot{Plots.GRBackend() n=1})

  period 10: 0.3910091915086631
Model training finished.
Final accuracy (on validation data): 0.8836

In [16]:

plot(p1)

Out[16]:

In [17]:

plot(p2)

Out[17]:

Task 2: Replace the Linear Classifier with a Neural Network

Replace the LinearClassifier above with a Neural Network and find a parameter combination that gives 0.95 or better accuracy.

You may wish to experiment with additional regularization methods, such as dropout.

The code below is almost identical to the original LinearClassifer training code, with the exception of the NN-specific configuration, such as the hyperparameter for hidden units.

In [18]:

function train_nn_classification_model(learning_rate,
                     steps, 
                     batch_size, 
                     hidden_units,
                     keep_probability,
                     training_examples, 
                     training_targets, 
                     validation_examples, 
                     validation_targets)
  """Trains a NN classification model for the MNIST digits dataset.
  
  In addition to training, this function also prints training progress information,
  a plot of the training and validation loss over time, and a confusion
  matrix.
  
  Args:
    learning_rate: An `int`, the learning rate to use.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    hidden_units: A vector describing the layout of the neural network.
    keep_probability: A `float`, the probability of keeping a node active during one training step.
    training_examples: An `Array` containing the training features.
    training_targets: An `Array` containing the training labels.
    validation_examples: An `Array` containing the validation features.
    validation_targets: An `Array` containing the validation labels.
      
  Returns:
    p1: Plot of loss metrics
    p2: Plot of confusion matrix
    y: Prediction layer of the NN.
    feature_columns: Feature column tensor of the NN.
    target_columns: Target column tensor of the NN.
    weight_export: Weights of the first layer of the NN.
  """
  
  periods = 10
  steps_per_period = steps / periods

  # Create feature columns.
  feature_columns = placeholder(Float32, shape=[-1, size(training_examples,2)])
  target_columns = placeholder(Float32, shape=[-1, size(training_targets,2)])
    
  # Network parameters
  push!(hidden_units,size(training_targets,2)) #create an output node that fits to the size of the targets
  activation_functions = Vector{Function}(size(hidden_units,1))
  activation_functions[1:end-1]=z->nn.dropout(nn.relu(z), keep_probability)
  activation_functions[end] = nn.softmax #Last function should be idenity as we need the logits  
    
  # create network
  flag=0
  weight_export=Variable([1])
  Zs = [feature_columns]
  for (ii,(hlsize, actfun)) in enumerate(zip(hidden_units, activation_functions))
        Wii = get_variable("W_$ii"*randstring(4), [get_shape(Zs[end], 2), hlsize], Float32)
        bii = get_variable("b_$ii"*randstring(4), [hlsize], Float32)
        Zii = actfun(Zs[end]*Wii + bii)
        push!(Zs, Zii)
        
        if(flag==0)
            weight_export=Wii
            flag=1
        end
  end

  y=Zs[end]
  cross_entropy = reduce_mean(-reduce_sum(target_columns .* log(y), axis=[2]))
 
  # Standard Adam Optimizer
  my_optimizer=train.minimize(train.AdamOptimizer(learning_rate), cross_entropy)

  run(sess, global_variables_initializer())

  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  println("Training model...")
  println("LogLoss error (on validation data):")
  training_errors = []
  validation_errors = []
  for period in 1:periods
    for i=1:steps_per_period
        
    # Train the model, starting from the prior state.
    features_batches, targets_batches = next_batch(loader, batch_size)
    run(sess, my_optimizer, Dict(feature_columns=>features_batches, target_columns=>targets_batches))    
    end
  
    # Take a break and compute probabilities.
    training_predictions = run(sess, y, Dict(feature_columns=> training_examples, target_columns=>training_targets)) 
    validation_predictions = run(sess, y, Dict(feature_columns=> validation_examples, target_columns=>validation_targets))
    
    # Compute training and validation errors.
    training_log_loss = sklm.log_loss(training_targets, training_predictions)
    validation_log_loss = sklm.log_loss(validation_targets, validation_predictions)
    # Occasionally print the current loss.
    println("  period ", period, ": ",validation_log_loss)
        # Add the loss metrics from this period to our list.
    push!(training_errors, training_log_loss)
    push!(validation_errors, validation_log_loss)
  end      
        
        
  println("Model training finished.")
  
  # Calculate final predictions (not probabilities, as above).
  final_probabilities = run(sess, y, Dict(feature_columns=> validation_examples, target_columns=>validation_targets))
  
  final_predictions=0.0.*copy(final_probabilities)
  for i=1:size(final_predictions,1)
        final_predictions[i,indmax(final_probabilities[i,:])]=1.0
  end

  accuracy = sklm.accuracy_score(validation_targets, final_predictions)
  println("Final accuracy (on validation data): ", accuracy)

  # Output a graph of loss metrics over periods.
  p1=plot(training_errors, label="training", title="LogLoss vs. Periods", ylabel="LogLoss", xlabel="Periods")
  p1=plot!(validation_errors, label="validation")
  
  # Output a plot of the confusion matrix.
  cm = sklm.confusion_matrix(to1col(validation_targets), to1col(final_predictions))
  # Normalize the confusion matrix by row (i.e by the number of samples
  # in each class).
  cm_normalized=convert.(Float32,copy(cm))
  for i=1:size(cm,1)
     cm_normalized[i,:]=cm[i,:]./sum(cm[i,:])
  end
    
  p2 = heatmap(cm_normalized, c=:dense, title="Confusion Matrix", ylabel="True label", xlabel= "Predicted label", xticks=(1:10, 0:9), yticks=(1:10, 0:9))

  return p1, p2, y, feature_columns, target_columns, weight_export  
end

Out[18]:

train_nn_classification_model (generic function with 1 method)

In [19]:

sess=Session(Graph())
p1, p2, y, feature_columns, target_columns, weight_export = train_nn_classification_model(
    # TWEAK THESE VALUES TO SEE HOW MUCH YOU CAN IMPROVE THE RMSE
    0.003, #learning rate
    1000, #steps
    30, #batch_size
    [100, 100], #hidden_units
    1.0, # keep probability
    training_examples,
    training_targets,
    validation_examples,
    validation_targets)

Training model...
LogLoss error (on validation data):
  period 1: 0.7570505327303662
  period 2: 0.6063774084079545
  period 3: 0.5113792795403802
  period 4: 0.396053814079678
  period 5: 0.3602445739727594
  period 6: 0.2950864450414929
  period 7: 0.2876376859507727
  period 8: 0.274247879869066
  period 9: 0.2485885503372391

Out[19]:

(Plot{Plots.GRBackend() n=2}, Plot{Plots.GRBackend() n=1}, <Tensor Softmax:1 shape=(?, 10) dtype=Float32>, <Tensor placeholder:1 shape=(?, 784) dtype=Float32>, <Tensor placeholder_2:1 shape=(?, 10) dtype=Float32>, TensorFlow.Variables.Variable{Float32}(<Tensor W_1Al09:1 shape=(784, 100) dtype=Float32>, <Tensor W_1Al09/Assign:1 shape=unknown dtype=Float32>))

  period 10: 0.2477123617914185
Model training finished.
Final accuracy (on validation data): 0.9232

In [20]:

plot(p1)

Out[20]:

In [21]:

plot(p2)

Out[21]:

Next, we verify the accuracy on a test set.

In [22]:

test_examples = preprocess_features(10001:13000)
test_targets = preprocess_targets(10001:13000);

In [23]:

test_probabilities = run(sess, y, Dict(feature_columns=> test_examples, target_columns=>test_targets))
  
test_predictions=0.0.*copy(test_probabilities)
for i=1:size(test_predictions,1)
    test_predictions[i,indmax(test_probabilities[i,:])]=1.0
end
  
accuracy = sklm.accuracy_score(test_targets, test_predictions)
println("Accuracy on test data: ", accuracy)

Accuracy on test data: 0.923

Task 3: Visualize the weights of the first hidden layer.

Let’s take a few minutes to dig into our neural network and see what it has learned by accessing the weights_export attribute of our model.

The input layer of our model has 784 weights corresponding to the 28×28 pixel input images. The first hidden layer will have 784×N weights where N is the number of nodes in that layer. We can turn those weights back into 28×28 images by reshaping each of the N 1×784 arrays of weights into N arrays of size 28×28.

Run the following cell to plot the weights. We construct a function that allows us to use a string as a variable name. This allows us to automatically name all plots. We then put together a string to display everything when evaluated.

In [28]:

function string_as_varname_function(s::AbstractString, v::Any)
   s = Symbol(s)
   @eval (($s) = ($v))
end

weights0 = run(sess, weight_export)

num_nodes=size(weights0,2)
num_row=convert(Int,ceil(num_nodes/10))
for i=1:num_nodes
    str_name=string("Heat",i)
    string_as_varname_function(str_name,   heatmap(reshape(weights0[:,i], (28,28)), c=:heat, legend=false, yticks=[], xticks=[] ) )
end

out_string="plot(Heat1"
for i=2:num_nodes-1
    out_string=string(out_string, ", Heat", i)
end
    out_string=string(out_string, ", Heat", num_nodes, ", layout=(num_row, 10), legend=false )")

eval(parse(out_string))

Out[28]:

Use the following line to have a closer look at individual plots.

In [26]:

plot(Heat98)

Out[26]:

The first hidden layer of the neural network should be modeling some pretty low level features, so visualizing the weights will probably just show some fuzzy blobs or possibly a few parts of digits. You may also see some neurons that are essentially noise — these are either unconverged or they are being ignored by higher layers.

It can be interesting to stop training at different numbers of iterations and see the effect.

Train the classifier for 10, 100 and respectively 1000 steps. Then run this visualization again.

What differences do you see visually for the different levels of convergence?

Improving Neural Net Performance

By: Sören Dobberschütz

Re-posted from: https://tensorflowjulia.blogspot.com/2018/09/improving-neural-net-performance.html

This is the last exercise that uses the California housing dataset. We investigate several possibilities of optimizing neural nets:

Different loss minimization algorithms
Linear scaling of features
Logarithmic scaling of features
Clipping of features
Z-score normalization
Thresholding of data

The Jupyter notebook can be downloaded here.

This notebook is based on the file Improving Neural Net Performance programming exercise, which is part of Google’s Machine Learning Crash Course.

In [0]:

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Improving Neural Net Performance

Learning Objective: Improve the performance of a neural network by normalizing features and applying various optimization algorithms

NOTE: The optimization methods described in this exercise are not specific to neural networks; they are effective means to improve most types of models.

Setup

First, we’ll load the data.

In [1]:

using Plots
using StatPlots
using Distributions
gr()
using DataFrames
using TensorFlow
import CSV
import StatsBase
using PyCall

sess=Session(Graph())
california_housing_dataframe = CSV.read("california_housing_train.csv", delim=",");
california_housing_dataframe = california_housing_dataframe[shuffle(1:size(california_housing_dataframe, 1)),:];

2018-09-03 17:02:50.066566: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA

In [2]:

function preprocess_features(california_housing_dataframe)
  """Prepares input features from California housing data set.

  Args:
    california_housing_dataframe: A DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
  """
  selected_features = california_housing_dataframe[
    [:latitude,
     :longitude,
     :housing_median_age,
     :total_rooms,
     :total_bedrooms,
     :population,
     :households,
     :median_income]]
  processed_features = selected_features
  # Create a synthetic feature.
  processed_features[:rooms_per_person] = (
    california_housing_dataframe[:total_rooms] ./
    california_housing_dataframe[:population])
  return processed_features
end
    
function preprocess_targets(california_housing_dataframe)
  """Prepares target features (i.e., labels) from California housing data set.

  Args:
    california_housing_dataframe: A DataFrame expected to contain data
      from the California housing data set.
  Returns:
    A DataFrame that contains the target feature.
  """
  output_targets = DataFrame()
  # Scale the target to be in units of thousands of dollars.
  output_targets[:median_house_value] = (
    california_housing_dataframe[:median_house_value] ./ 1000.0)
  return output_targets
end

Out[2]:

preprocess_targets (generic function with 1 method)

In [3]:

# Choose the first 12000 (out of 17000) examples for training.
training_examples = preprocess_features(head(california_housing_dataframe,12000))
training_targets = preprocess_targets(head(california_housing_dataframe,12000))

# Choose the last 5000 (out of 17000) examples for validation.
validation_examples = preprocess_features(tail(california_housing_dataframe,5000))
validation_targets = preprocess_targets(tail(california_housing_dataframe,5000))

# Double-check that we've done the right thing.
println("Training examples summary:")
describe(training_examples)
println("Validation examples summary:")
describe(validation_examples)

println("Training targets summary:")
describe(training_targets)
println("Validation targets summary:")
describe(validation_targets)

Training examples summary:

Out[3]:

	variable	mean	min	median	max	nunique	nmissing	eltype
1	median_house_value	210.168	25.0	182.35	500.001			Float64

Validation examples summary:
Training targets summary:
Validation targets summary:

Train the Neural Network

Next, we’ll set up the neural network similar to the previous exercise.

In [10]:

function construct_columns(input_features):
  """Construct the TensorFlow Feature Columns.

  Args:
    input_features: DataFrame of the numerical input features to use.
  Returns:
    A set of feature columns
  """ 
  out=convert(Array, input_features[:,:])
  return convert.(Float64,out) 
end

Out[10]:

construct_columns (generic function with 1 method)

In [4]:

function create_batches(features, targets, steps, batch_size=5, num_epochs=0)
  """Create batches.

  Args:
    features: Input features.
    targets: Target column.
    steps: Number of steps.
    batch_size: Batch size.
    num_epochs: Number of epochs, 0 will let TF automatically calculate the correct number
  Returns:
    An extended set of feature and target columns from which batches can be extracted.
  """  
    
    if(num_epochs==0)
        num_epochs=ceil(batch_size*steps/size(features,1))
    end
    
    names_features=names(features);
    names_targets=names(targets);
    
    features_batches=copy(features)
    target_batches=copy(targets)
    

    for i=1:num_epochs
        
        select=shuffle(1:size(features,1))
     
        if i==1
            features_batches=(features[select,:])
            target_batches=(targets[select,:])
        else
            
            append!(features_batches, features[select,:])
            append!(target_batches, targets[select,:])
        end
    end
    
    return features_batches, target_batches 
end


function next_batch(features_batches, targets_batches, batch_size, iter)
  """Next batch.

  Args:
    features_batches: Features batches from create_batches.
    targets_batches: Target batches from create_batches.
    batch_size: Batch size.
    iter: Number of the current iteration
  Returns:
    An extended set of feature and target columns from which batches can be extracted.
  """ 
    select=mod((iter-1)*batch_size+1, size(features_batches,1)):mod(iter*batch_size, size(features_batches,1));

    ds=features_batches[select,:];
    target=targets_batches[select,:];
    
    return ds, target
end

Out[4]:

next_batch (generic function with 1 method)

In [6]:

function my_input_fn(features_batches, targets_batches, iter, batch_size=5, shuffle_flag=1):
    """Prepares a batch of features and labels for model training.
  
    Args:
      features_batches: Features batches from create_batches.
      targets_batches: Target batches from create_batches.
      iter: Number of the current iteration
      batch_size: Batch size.
      shuffle_flag: Determines wether data is shuffled before being returned
    Returns:
      Tuple of (features, labels) for next data batch
    """  
                          
    # Construct a dataset, and configure batching/repeating.
    ds, target = next_batch(features_batches, targets_batches, batch_size, iter)
    
    # Shuffle the data, if specified.
    if shuffle_flag==1
      select=shuffle(1:size(ds, 1));
        ds = ds[select,:]
        target = target[select, :]
    end
    
    # Return the next batch of data.
    return ds, target
end

Out[6]:

my_input_fn (generic function with 3 methods)

Now we can set up the neural network itself.

In [14]:

function train_nn_regression_model(my_optimizer,
                     steps, 
                     batch_size, 
                     hidden_units,
                     keep_probability,
                     training_examples, 
                     training_targets, 
                     validation_examples, 
                     validation_targets)
  """Trains a neural network model of one feature.
  
  Args:
    my_optimizer: Optimizer function for the training step
    learning_rate: A `float`, the learning rate.
    steps: A non-zero `int`, the total number of training steps. A training step
      consists of a forward and backward pass using a single batch.
    batch_size: A non-zero `int`, the batch size.
    hidden_units: A vector describing the layout of the neural network
    keep_probability: A `float`, the probability of keeping a node active during one training step.
  Returns:
    p1: Plot of RMSE for the different periods
    training_rmse: Training RMSE values for the different periods
    validation_rmse: Validation RMSE values for the different periods
    
  """
  
  periods = 10
  steps_per_period = steps / periods

  # Create feature columns.
  feature_columns = placeholder(Float32, shape=[-1, size(construct_columns(training_examples),2)])
  target_columns = placeholder(Float32, shape=[-1, size(construct_columns(training_targets),2)])
  
  # Network parameters
  push!(hidden_units,size(training_targets,2)) #create an output node that fits to the size of the targets
  activation_functions = Vector{Function}(size(hidden_units,1))
  activation_functions[1:end-1]=z->nn.dropout(nn.relu(z), keep_probability)
  activation_functions[end] = identity #Last function should be idenity as we need the logits  
    
  # create network - professional template
  Zs = [feature_columns]
  for (ii,(hlsize, actfun)) in enumerate(zip(hidden_units, activation_functions))
        Wii = get_variable("W_$ii"*randstring(4), [get_shape(Zs[end], 2), hlsize], Float32)
        bii = get_variable("b_$ii"*randstring(4), [hlsize], Float32)
        Zii = actfun(Zs[end]*Wii + bii)
        push!(Zs, Zii)
  end
   
  y=Zs[end]
  loss=reduce_sum((target_columns - y).^2)
 
  features_batches, targets_batches = create_batches(training_examples, training_targets, steps, batch_size)
    
  # Optimizer setup with gradient clipping
  gvs = train.compute_gradients(my_optimizer, loss)
  capped_gvs = [(clip_by_norm(grad, 5.), var) for (grad, var) in gvs]
  my_optimizer = train.apply_gradients(my_optimizer,capped_gvs)
    
  run(sess, global_variables_initializer())
    
  # Train the model, but do so inside a loop so that we can periodically assess
  # loss metrics.
  println("Training model...")
  println("RMSE (on training data):")
  training_rmse = []
  validation_rmse=[]
  
  for period in 1:periods
    # Train the model, starting from the prior state.
   for i=1:steps_per_period
    features, labels = my_input_fn(features_batches, targets_batches, convert(Int,(period-1)*steps_per_period+i), batch_size)
    run(sess, my_optimizer, Dict(feature_columns=>construct_columns(features), target_columns=>construct_columns(labels)))
   end
    # Take a break and compute predictions.
    training_predictions = run(sess, y, Dict(feature_columns=> construct_columns(training_examples)));    
    validation_predictions = run(sess, y, Dict(feature_columns=> construct_columns(validation_examples)));  
                                   
    # Compute loss.
     training_mean_squared_error = mean((training_predictions- construct_columns(training_targets)).^2)
     training_root_mean_squared_error = sqrt(training_mean_squared_error)
     validation_mean_squared_error = mean((validation_predictions- construct_columns(validation_targets)).^2)
     validation_root_mean_squared_error = sqrt(validation_mean_squared_error)
    # Occasionally print the current loss.
    println("  period ", period, ": ", training_root_mean_squared_error)
    # Add the loss metrics from this period to our list.
    push!(training_rmse, training_root_mean_squared_error)
    push!(validation_rmse, validation_root_mean_squared_error)
 end
    
  println("Model training finished.")

  # Output a graph of loss metrics over periods.
  p1=plot(training_rmse, label="training", title="Root Mean Squared Error vs. Periods", ylabel="RMSE", xlabel="Periods")
  p1=plot!(validation_rmse, label="validation")
    
  #
  println("Final RMSE (on training data): ", training_rmse[end])
  println("Final RMSE (on validation data): ", validation_rmse[end])
    
  return  p1, training_rmse, validation_rmse
end

Out[14]:

train_nn_regression_model (generic function with 1 method)

Train the model with a Gradient Descent Optimizer and a learning rate of 0.0007.

In [11]:

p1, training_rmse, validation_rmse = train_nn_regression_model(
    train.GradientDescentOptimizer(0.0007), #optimizer & learning rate
    5000, #steps
    70, #batch_size
    [10, 10], #hidden_units
    1.0, # keep probability
    training_examples,
    training_targets,
    validation_examples,
    validation_targets)

Training model...
RMSE (on training data):
  period 1: 163.180295637483
  period 2: 161.26135156851018
  period 3: 152.5080762133199
  period 4: 131.01682893731694
  period 5: 104.81629292310197
  period 6: 101.90063143465281
  period 7: 103.65539145744539
  period 8: 99.97967678136483
  period 9: 99.5169919104292
  period 10: 99.85829500231807
Model training finished.

Out[11]:

(Plot{Plots.GRBackend() n=2}, Any[163.18, 161.261, 152.508, 131.017, 104.816, 101.901, 103.655, 99.9797, 99.517, 99.8583], Any[164.89, 162.075, 153.699, 132.176, 105.743, 102.463, 104.437, 100.265, 100.328, 100.597])

Final RMSE (on training data): 99.85829500231807
Final RMSE (on validation data): 100.59742834395213

In [12]:

plot(p1)

Out[12]:

Linear Scaling

It can be a good standard practice to normalize the inputs to fall within the range -1, 1. This helps SGD not get stuck taking steps that are too large in one dimension, or too small in another. Fans of numerical optimization may note that there’s a connection to the idea of using a preconditioner here.

In [13]:

function linear_scale(series)
  min_val = minimum(series)
  max_val = maximum(series)
  scale = (max_val - min_val) / 2.0
  return (series .- min_val) ./ scale .- 1.0
end

Out[13]:

linear_scale (generic function with 1 method)

Task 1: Normalize the Features Using Linear Scaling

Normalize the inputs to the scale -1, 1.

As a rule of thumb, NN’s train best when the input features are roughly on the same scale.

Sanity check your normalized data. (What would happen if you forgot to normalize one feature?)

Since normalization uses min and max, we have to ensure it’s done on the entire dataset at once.

We can do that here because all our data is in a single DataFrame. If we had multiple data sets, a good practice would be to derive the normalization parameters from the training set and apply those identically to the test set.

In [15]:

function normalize_linear_scale(examples_dataframe):
  """Returns a version of the input `DataFrame` that has all its features normalized linearly."""
  processed_features = DataFrame()
  processed_features[:latitude] = linear_scale(examples_dataframe[:latitude])
  processed_features[:longitude] = linear_scale(examples_dataframe[:longitude])
  processed_features[:housing_median_age] = linear_scale(examples_dataframe[:housing_median_age])
  processed_features[:total_rooms] = linear_scale(examples_dataframe[:total_rooms])
  processed_features[:total_bedrooms] = linear_scale(examples_dataframe[:total_bedrooms])
  processed_features[:population] = linear_scale(examples_dataframe[:population])
  processed_features[:households] = linear_scale(examples_dataframe[:households])
  processed_features[:median_income] = linear_scale(examples_dataframe[:median_income])
  processed_features[:rooms_per_person] = linear_scale(examples_dataframe[:rooms_per_person])
  return processed_features
end

normalized_dataframe = normalize_linear_scale(preprocess_features(california_housing_dataframe))
normalized_training_examples = head(normalized_dataframe, 12000)
normalized_validation_examples = tail(normalized_dataframe, 5000)

p1, graddescent_training_rmse, graddescent_validation_rmse = train_nn_regression_model(
    train.GradientDescentOptimizer(0.005),
    2000,
    50,
    [10, 10],
    1.0,
    normalized_training_examples,
    training_targets,
    normalized_validation_examples,
    validation_targets)

Training model...
RMSE (on training data):
  period 1: 116.09077765307714
  period 2: 106.39510919357569
  period 3: 92.2020458478069
  period 4: 78.05842296357487
  period 5: 75.76520735272948
  period 6: 74.19271740734389
  period 7: 72.9324235474891
  period 8: 72.26513417353931
  period 9: 71.69664884683169
  period 10: 71.22432996656671
Model training finished.

Out[15]:

(Plot{Plots.GRBackend() n=2}, Any[116.091, 106.395, 92.202, 78.0584, 75.7652, 74.1927, 72.9324, 72.2651, 71.6966, 71.2243], Any[117.94, 108.035, 93.02, 77.7788, 75.1039, 73.3773, 71.9785, 71.1964, 70.5865, 70.0878])

Final RMSE (on training data): 71.22432996656671
Final RMSE (on validation data): 70.08780674123477

In [16]:

describe(normalized_dataframe)

Out[16]:

	variable	mean	min	median	max	eltype
1	latitude	-0.344267	-1.0	-0.636557	1.0	Float64
2	longitude	-0.0462367	-1.0	0.167331	1.0	Float64
3	housing_median_age	0.0819354	-1.0	0.0980392	1.0	Float64
4	total_rooms	-0.860727	-1.0	-0.887966	1.0	Float64
5	total_bedrooms	-0.832895	-1.0	-0.865611	1.0	Float64
6	population	-0.920033	-1.0	-0.934752	1.0	Float64
7	households	-0.83548	-1.0	-0.865812	1.0	Float64
8	median_income	-0.533292	-1.0	-0.580047	1.0	Float64
9	rooms_per_person	-0.928886	-1.0	-0.930325	1.0	Float64

In [17]:

plot(p1)

Out[17]:

Task 2: Try a Different Optimizer

Use the Momentum and Adam optimizers and compare performance.

The Momentum optimizer is one alternative. The key insight of Momentum is that a gradient descent can oscillate heavily in case the sensitivity of the model to parameter changes is very different for different model parameters. So instead of just updating the weights and biases in the direction of reducing the loss for the current step, the optimizer combines it with the direction from the previous step. You can use Momentum by specifying MomentumOptimizer instead of GradientDescentOptimizer. Note that you need to give two parameters – a learning rate and a “momentum” – with Momentum.

For non-convex optimization problems, Adam is sometimes an efficient optimizer. To use Adam, invoke the train.AdamOptimizer method. This method takes several optional hyperparameters as arguments, but our solution only specifies one of these (learning_rate). In a production setting, you should specify and tune the optional hyperparameters carefully.

First, let’s try Momentum Optimizer.

In [42]:

p1, momentum_training_rmse, momentum_validation_rmse = train_nn_regression_model(
    train.MomentumOptimizer(0.005, 0.05),
    2000,
    50,
    [10, 10],
    1.0,
    normalized_training_examples,
    training_targets,
    normalized_validation_examples,
    validation_targets)

Training model...
RMSE (on training data):
  period 1: 112.6311447590545
  period 2: 108.05888663813701
  period 3: 100.13551755861181
  period 4: 85.68693847431287
  period 5: 82.32114201488704
  period 6: 78.33198134267947
  period 7: 76.201679958578
  period 8: 75.14959736130605
  period 9: 76.6816266464294
  period

Out[42]:

(Plot{Plots.GRBackend() n=2}, Any[112.631, 108.059, 100.136, 85.6869, 82.3211, 78.332, 76.2017, 75.1496, 76.6816, 74.2158], Any[114.764, 109.533, 101.738, 85.7742, 81.3485, 77.036, 74.8827, 73.7446, 75.1419, 72.7901])

10: 74.21582562782943
Model training finished.
Final RMSE (on training data): 74.21582562782943
Final RMSE (on validation data): 72.79005775397246

In [43]:

plot(p1)

Out[43]:

Now let’s try Adam.

In [52]:

p1, adam_training_rmse, adam_validation_rmse = train_nn_regression_model(
    train.AdamOptimizer(0.2),
    2000,
    50,
    [10, 10],
    1.0,
    normalized_training_examples,
    training_targets,
    normalized_validation_examples,
    validation_targets)

Training model...
RMSE (on training data):
  period 1: 72.64160867170764
  period 2: 71.12902983578199
  period 3: 77.11712739613068
  period 4: 68.69780346576317
  period 5: 76.85117566160234
  period 6: 74.97801908512282
  period 7: 74.08747095626799
  period 8: 89.26232409952414
  period 9: 67.50005522623385
  period

Out[52]:

(Plot{Plots.GRBackend() n=2}, Any[72.6416, 71.129, 77.1171, 68.6978, 76.8512, 74.978, 74.0875, 89.2623, 67.5001, 69.3121], Any[71.2033, 69.9634, 76.0729, 66.8816, 75.8678, 74.0505, 73.0449, 89.2644, 66.1359, 67.6034])

10: 69.3121128893884
Model training finished.
Final RMSE (on training data): 69.3121128893884
Final RMSE (on validation data): 67.60344861121533

In [53]:

plot(p1)

Out[53]:

Let’s print a graph of loss metrics side by side.

In [54]:

p2=plot(graddescent_training_rmse, label="Gradient descent training", ylabel="RMSE", xlabel="Periods", title="Root Mean Squared Error vs. Periods")
p2=plot!(graddescent_validation_rmse, label="Gradient descent validation")
p2=plot!(adam_training_rmse, label="Adam training")
p2=plot!(adam_validation_rmse, label="Adam validation")
p2=plot!(momentum_training_rmse, label="Momentum training")
p2=plot!(momentum_validation_rmse, label="Momentum validation")

Out[54]:

Task 3: Explore Alternate Normalization Methods

Try alternate normalizations for various features to further improve performance.

If you look closely at summary stats for your transformed data, you may notice that linear scaling some features leaves them clumped close to -1.

For example, many features have a median of -0.8 or so, rather than 0.0.

In [22]:

# I'd like a better solution to automate this, but all ideas for eval
# on quoted expressions failed :-()
hist1=histogram(normalized_training_examples[:latitude], bins=20,  title="latitude"  )
hist2=histogram(normalized_training_examples[:longitude], bins=20,  title="longitude"  )
hist3=histogram(normalized_training_examples[:housing_median_age], bins=20,  title="housing_median_age"  )
hist4=histogram(normalized_training_examples[:total_rooms], bins=20,  title="total_rooms"  )
hist5=histogram(normalized_training_examples[:total_bedrooms], bins=20,  title="total_bedrooms"  )
hist6=histogram(normalized_training_examples[:population], bins=20,  title="population"  )
hist7=histogram(normalized_training_examples[:households], bins=20,  title="households"  )
hist8=histogram(normalized_training_examples[:median_income], bins=20,  title="median_income"  )
hist9=histogram(normalized_training_examples[:rooms_per_person], bins=20,  title="rooms_per_person"  )

plot(hist1, hist2, hist3, hist4, hist5, hist6, hist7, hist8, hist9, layout=9, legend=false)

Out[22]:

We might be able to do better by choosing additional ways to transform these features.

For example, a log scaling might help some features. Or clipping extreme values may make the remainder of the scale more informative.

In [23]:

function log_normalize(series)
  return log.(series.+1.0)
end

function clip(series, clip_to_min, clip_to_max)
  return min.(max.(series, clip_to_min), clip_to_max)
end

function z_score_normalize(series)
  mean_val = mean(series)
  std_dv = std(series, mean=mean_val)
  return (series .- mean) ./ std_dv
end

function binary_threshold(series, threshold)
  return map(x->(x > treshold ? 1 : 0), series)
end

Out[23]:

binary_threshold (generic function with 1 method)

The block above contains a few additional possible normalization functions.

Note that if you normalize the target, you’ll need to un-normalize the predictions for loss metrics to be comparable.

These are only a few ways in which we could think about the data. Other transformations may work even better!

households, median_income and total_bedrooms all appear normally-distributed in a log space.

In [24]:

hist10=histogram(log_normalize(california_housing_dataframe[:households]), title="households")
hist11=histogram(log_normalize(california_housing_dataframe[:total_rooms]), title="total_rooms")
hist12=histogram(log_normalize(training_examples[:rooms_per_person]), title="rooms_per_person")
plot(hist10, hist11, hist12, layout=3, legend=false)

Out[24]:

latitude, longitude and housing_median_age would probably be better off just scaled linearly, as before.

population, total_rooms and rooms_per_person have a few extreme outliers. They seem too extreme for log normalization to help. So let’s clip them instead.

In [46]:

function normalize_df(examples_dataframe)
  """Returns a version of the input `DataFrame` that has all its features normalized."""
  processed_features = DataFrame()

  processed_features[:households] = log_normalize(examples_dataframe[:households])
  processed_features[:median_income] = log_normalize(examples_dataframe[:median_income])
  processed_features[:total_bedrooms] = log_normalize(examples_dataframe[:total_bedrooms])
  
  processed_features[:latitude] = linear_scale(examples_dataframe[:latitude])
  processed_features[:longitude] = linear_scale(examples_dataframe[:longitude])
  processed_features[:housing_median_age] = linear_scale(examples_dataframe[:housing_median_age])

  processed_features[:population] = linear_scale(clip(examples_dataframe[:population], 0, 5000))
  processed_features[:rooms_per_person] = linear_scale(clip(examples_dataframe[:rooms_per_person], 0, 5))
  processed_features[:total_rooms] = linear_scale(clip(examples_dataframe[:total_rooms], 0, 10000))

  return processed_features
end
    
normalized_dataframe = normalize_df(preprocess_features(california_housing_dataframe))
normalized_training_examples = head(normalized_dataframe,12000)
normalized_validation_examples = tail(normalized_dataframe,5000)

p1, adam_training_rmse, adam_validation_rmse = train_nn_regression_model(
    train.AdamOptimizer(0.15),
    2000,
    50,
    [10, 10],
    1.0,
    normalized_training_examples,
    training_targets,
    normalized_validation_examples,
    validation_targets)

Training model...
RMSE (on training data):
  period 1: 74.72096056179495
  period 2: 71.41889262056681
  period 3: 70.60752044614021
  period 4: 68.9509575179693
  period 5: 72.95804802579956
  period 6: 66.77946206351353
  period 7: 69.60194185199468
  period 8: 68.58383648972531
  period 9: 66.68706380224602
  period

Out[46]:

(Plot{Plots.GRBackend() n=2}, Any[74.721, 71.4189, 70.6075, 68.951, 72.958, 66.7795, 69.6019, 68.5838, 66.6871, 69.4884], Any[72.9878, 70.3855, 69.3093, 67.5559, 71.9222, 65.411, 68.4134, 67.2482, 65.4761, 68.2675])

10: 69.48840964307104
Model training finished.
Final RMSE (on training data): 69.48840964307104
Final RMSE (on validation data): 68.26751863022265

In [47]:

plot(p1)

Out[47]:

Optional Challenge: Use only Latitude and Longitude Features

Train a NN model that uses only latitude and longitude as features.

Real estate people are fond of saying that location is the only important feature in housing price. Let’s see if we can confirm this by training a model that uses only latitude and longitude as features.

This will only work well if our NN can learn complex nonlinearities from latitude and longitude.

NOTE: We may need a network structure that has more layers than were useful earlier in the exercise.

It’s a good idea to keep latitude and longitude normalized:

In [35]:

function location_location_location(examples_dataframe)
  """Returns a version of the input `DataFrame` that keeps only the latitude and longitude."""
  processed_features = DataFrame()
  processed_features[:latitude] = linear_scale(examples_dataframe[:latitude])
  processed_features[:longitude] = linear_scale(examples_dataframe[:longitude])
  return processed_features
end

lll_dataframe = location_location_location(preprocess_features(california_housing_dataframe))
lll_training_examples = head(lll_dataframe,12000)
lll_validation_examples = tail(lll_dataframe,5000)

p1, lll_training_rmse, lll_validation_rmse = train_nn_regression_model(
    train.AdamOptimizer(0.15),
    500,
    100,
    [10, 10, 5, 5],
    1.0,
    lll_training_examples,
    training_targets,
    lll_validation_examples,
    validation_targets)

Training model...
RMSE (on training data):
  period 1: 114.70454963731467
  period 2: 103.98212569567914
  period 3: 105.269708371533
  period 4: 99.07570050503281
  period 5: 109.85984129891541
  period 6: 99.30679344927408
  period 7: 98.08193175407696
  period 8: 98.14540308728282
  period 9: 107.40972986461607
  period

Out[35]:

(Plot{Plots.GRBackend() n=2}, Any[114.705, 103.982, 105.27, 99.0757, 109.86, 99.3068, 98.0819, 98.1454, 107.41, 103.183], Any[117.767, 106.149, 107.667, 100.831, 110.271, 101.503, 99.7394, 99.7085, 108.069, 105.764])

10: 103.18311789130752
Model training finished.
Final RMSE (on training data): 103.18311789130752
Final RMSE (on validation data): 105.76414082474946

In [36]:

plot(p1)

Out[36]:

This isn’t too bad for just two features. Of course, property values can still vary significantly within short distances.

In [ ]:

#EOF

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: Neural Nets

Intro to Sparse Data and Embeddings

Intro to Sparse Data and Embeddings

Setup

Building a Sentiment Analysis Model

Building the Input Pipeline

Task 1: Use a Linear Model with Sparse Inputs and an Explicit Vocabulary

Task 2: Use a Deep Neural Network (DNN) Model

Task 3: Use an Embedding with a DNN Model

Task 4: Examine the Embedding

Task 5: Try to improve the model’s performance

Task 6: Try out sparse matrices

A Final Word

Classifying Handwritten Digits with Neural Networks

Classifying Handwritten Digits with Neural Networks¶

Setup

Task 1: Build a Linear Model for MNIST

Task 2: Replace the Linear Classifier with a Neural Network

Task 3: Visualize the weights of the first hidden layer.

Improving Neural Net Performance

Improving Neural Net Performance

Setup

Train the Neural Network

Linear Scaling

Task 1: Normalize the Features Using Linear Scaling

Task 2: Try a Different Optimizer

Task 3: Explore Alternate Normalization Methods

Optional Challenge: Use only Latitude and Longitude Features