In this tutorial you will learn how to implement a simple feed-forward neural network in Ruby to classify hand written digits. We will focus on the implementation and skip over the theory. If you need a better foundational understanding, I suggest you go through the neural network videos of the Standford Machine Learning course first.
We will use our model to compete in the Kaggle digit recognizer competition. The goal of the competition is to classify images of single hand written digits from the famous MNIST dataset.
The code created in this tutorial will train on the train.csv
file and be evaluated against the test.csv
file both provided by Kaggle.
Below is a visualization of the training data. The script that generated collage of randomly selected digits from the train.csv
is available at the ruby-lab repo.
Quick Start
If you are impatient like me, you would want to run the code first and then figure out how it works, you can get the code along with the data by cloning the tutorial’s Github repo.
git clone https://github.com/Henry-Chinner/ruby-lab-code.git
Or download the zip.
Then install the following gems…
nmatrix
: is a linear algebra library which we will use for matrix multiplication.fastest-csv
: to load training data from csv files.chunky_png
: is optional if you want to visualize the data.
Unzip the 2 files in the /data
folder.
cd Into the /src
folder of the code that you just downloaded.
cd ruby-lab-code
cd simple_neural_network
cd src
And start training the network with…
ruby neural_net.rb
You should see training information being printed out on your console. You can kill the process at any time, or wait for it to finish learning. Once completed, the trained network weights will be saved to the /data
folder.
Under the Hood
The rest of the tutorial will cover the internals of neural_net.rb
Network Architecture
neural_net.rb
implements a simple 3-layer feed-forward architecture. Since the dataset consists of images sized 28 x 28 pixels, the input vector’s size is 785, 784 dimensions for the pixels and 1 for the bias unit. The Network has 300 units in the hidden layer (adjustable) and the output layer contains 10 units to represent the probability distribution of the label predictions.
The default activation function for the hidden layer and output layers are tanh and softmax respectively. The input layer is scaled down by dividing each element by the max value 255.
The method of training will be backpropagation via stochastic gradient descent.
Classes
neural_net.rb
handles data from the csv files through data_loader.rb
.
data_loader.rb
contains 2 classes. DataTable
and Observation
.
When DataTable
is initialized, it loads a csv file into an array of Observation
objects. The content of each line of the csv file is transformed into a Observation
object which is aware of what the label
and features
are.
The idea of DataTable
is to sample random observations from the dataset as follows.
'require' data_loader.rb
#first column in the csv file contains the label translates to a label_index of 0
table = DataTable.new({:file => '../data/train.csv' , :label_index => 0})
observation = table.sample
features = observation.features
label = observation.label
NeuralNet
is initialized in 1 of 2 modes. ‘train’ or ‘eval’. This option is passed as an argument when launching neural_net.rb
from the terminal with
ruby neural_net.rb --mode 'train'
or
ruby neural_net.rb --mode 'eval'
Initializing with train
loads up the csv or a cache of the DataTable
object and invokes the train
method. eval
invokes the create_test_submission
method which will run through the test.csv and make label predictions based on the model saved to disk earlier by the train
method.
Main Methods
The NeuralNet
object runs on 3 main methods train
, forward
and backprop
. Next up we will be looking at each of these methods.
train
train
initializes fresh weights every time it is called. It then enters an infinite loop, getting random samples from the train dataset and applying a forward pass and backward pass on it to get the weight updates.
sample
-> forward
-> backprop
-> repeat
After training conditions have been met, the train
method breaks and the trained weights are saved to disk.
def train
puts "Entered Training"
i = 0
start_time = Time.now
initialize_new_weights
loop do
# forward pass in the network with a random observation from @dt.sample.
# eval_or_train is passed as train becuase the forward method has to pass
# it's results to the backprop method. The backprop method will update the weights
forward(@dt.sample,{:eval_or_train => 'train'})
ave_error_history = running_average(1000,@error_history)
ave_error_history_5000 = running_average(5000,@error_history)
ave_classification_history = running_average(1000,@classification_history)
ave_classification_history_5000 = running_average(5000,@classification_history)
ratio = (ave_classification_history / ave_classification_history_5000)
puts "Running Average Error (1000) => #{ave_error_history}"
puts "Running Average Error (5000) => #{ave_error_history_5000}"
puts "Running Average Classification (1000) => #{ave_classification_history} "
puts "Running Average Classification (5000) => #{ave_classification_history_5000}"
puts "Classification Runninge Average Ratio => #{ratio}"
puts "Iteration = #{i}"
puts "---"
if ratio < 1.0 and i > 60000
finish_time = Time.now
puts File.open('../data/w1.txt','w'){|f| f << Marshal.dump(@w1.to_a)}
puts File.open('../data/w2.txt','w'){|f| f << Marshal.dump(@w2.to_a)}
puts "Total training time was: #{(finish_time - start_time).round(0)} sec"
break
end
i += 1
end
end
forward
The forward method consists of a series of matrix operations. The goal of the forward pass it to propagate the input vector forward to end up with a probability distribution of the digit set.
def forward(observation,opts ={})
#convert the features array into a NMatrix matrix and divide every element by 255.
#the division scales down the input.The input vector is initialized with size
#1 bigger than the @input_size. This is to accommodate the bias term
a1 = observation.features.flatten.to_nm([1,@input_size + 1]) / 255.0
#Set the bias term equal to 1
#NMatrix 2 dimensional matrices can be accessed via [row,column]
a1[0, @input_size ] = 1.0
#pass the product of the input values and the arc weight forward
#and sum the product up at each node
z2 = a1.dot(@w1)
#apply the activation function to the sum vector element wise
a2 = activation_function(z2,@hidden_func)
#resize the hidden layer to add the bias unit
a2_with_bias = NMatrix.zeroes([1,@hidden_nodes+1])
a2_with_bias[0,0..@hidden_nodes] = a2
a2_with_bias[0,@hidden_nodes] = 1.0
#z3 = a2 x @w2, propogating the hidden layer forward to get the sums in the output layer
z3 = a2_with_bias.dot(@w2)
#Softmax activation function in the output layer
a3 = activation_function(z3,@output_func)
#if in training mode, pass values of layers to backprop.
#otherwise return the prediction the output layer
if opts[:eval_or_train] == 'train'
backprop(a1,a2_with_bias,z2,z3,a3,observation.label)
elsif opts[:eval_or_train] == 'eval'
return a3.each_with_index.max[1]
end
end
backprop
backprop
is the final piece in the puzzle. It receives all it’s parameters from forward
and then makes the weight adjustments that will result in a better classification rate. How the weight update works exactly is outside the scope of this tutorial.
def backprop(a1,a2_with_bias,z2,z3,a3,label)
#initiates the output vector of zeroes
y = NMatrix.zeroes([1,10])
#set the label from the data to 1
#only 1 element can be 1 at a time as classes
#are mutually exclusive
y[0,label] = 1.0
#derivative of the loss function. Difference between predicted
#values and the true value
d3 = -(y - a3)
#using the derivative d3 is a good enough measure to
#see if the cost is decreasing so we append it to
#the error history
@error_history << d3.transpose.abs.sum[0]
#add 1 to the classification history if the prediction
#is correct, otherwise zero
@classification_history << (a3.each_with_index.max[1] == label ? 1.0 : 0.0)
# derivative, has the same size as the hidden layer. The range [] operator
# excludes the bias node. No error is passed to the bias node.
d2 = @w2.dot( d3.transpose )[0..(@hidden_nodes-1)] * derivative(z2.transpose,@hidden_func)
#matrix with dimensions equal to @w1's dimensions
#each element contains the gradient of the weight
#with respect to the cost function. If the weights
#are reduced by a small fraction of this value the cost function
#will go down
grad1 = d2.dot(a1)
#same for @w2
grad2 = d3.transpose.dot(a2_with_bias)
# updating the weigh matrices. The first layer is updated
# by a factor of 10 less than than the second layer. for numerical
# stability. Big weight changes -> big weights -> equals big sums -> saturated neurons
@w1 = @w1 - grad1.transpose * @alpha * 0.1
@w2 = @w2 - grad2.transpose * @alpha
end
Conclusion
Now that you have better idea of the internals. Go back to the QuickStart section and train a network yourself. I went ahead and made a submission to Kaggle based this tutorial’s implementation and achieved a classification rate of 95% which is not bad for a standard network.
You can create your own submission by running…
ruby neural_net.rb --mode 'train'
ruby neural_net.rb --mode 'eval'
The submission csv file will be in your data folder.
You can play around with different hidden node sizes when training by passing the following options…
ruby neural_net.rb --mode train --hidden_nodes x
Lower weight counts make for faster training, so if you get frustrated with the speed. Stop and start again with fewer hidden units.
In follow up tutorials we will try to get a better classification rate and also speed up the execution of the program.