Simple Neural Network implementation in Ruby

via the nmatrix gem.

Posted by Henry Chinner on March 18, 2015

In this tutorial you will learn how to implement a simple feed-forward neural network in Ruby to classify hand written digits. We will focus on the implementation and skip over the theory. If you need a better foundational understanding, I suggest you go through the neural network videos of the Standford Machine Learning course first.

We will use our model to compete in the Kaggle digit recognizer competition. The goal of the competition is to classify images of single hand written digits from the famous MNIST dataset.

The code created in this tutorial will train on the train.csv file and be evaluated against the test.csv file both provided by Kaggle.

Below is a visualization of the training data. The script that generated collage of randomly selected digits from the train.csv is available at the ruby-lab repo.

MNIST Dataset random sample

Quick Start

If you are impatient like me, you would want to run the code first and then figure out how it works, you can get the code along with the data by cloning the tutorial’s Github repo.

git clone

Or download the zip.

Then install the following gems…

  • nmatrix: is a linear algebra library which we will use for matrix multiplication.
  • fastest-csv: to load training data from csv files.
  • chunky_png: is optional if you want to visualize the data.

Unzip the 2 files in the /data folder.

cd Into the /src folder of the code that you just downloaded.

cd ruby-lab-code 
cd simple_neural_network 
cd src

And start training the network with…

ruby neural_net.rb

You should see training information being printed out on your console. You can kill the process at any time, or wait for it to finish learning. Once completed, the trained network weights will be saved to the /data folder.

Under the Hood

The rest of the tutorial will cover the internals of neural_net.rb

Network Architecture

neural_net.rb implements a simple 3-layer feed-forward architecture. Since the dataset consists of images sized 28 x 28 pixels, the input vector’s size is 785, 784 dimensions for the pixels and 1 for the bias unit. The Network has 300 units in the hidden layer (adjustable) and the output layer contains 10 units to represent the probability distribution of the label predictions.

Digit Recognizer Neural Network Architecture

The default activation function for the hidden layer and output layers are tanh and softmax respectively. The input layer is scaled down by dividing each element by the max value 255.

The method of training will be backpropagation via stochastic gradient descent.


neural_net.rb handles data from the csv files through data_loader.rb.

data_loader.rb contains 2 classes. DataTable and Observation.

When DataTable is initialized, it loads a csv file into an array of Observation objects. The content of each line of the csv file is transformed into a Observation object which is aware of what the label and features are.

The idea of DataTable is to sample random observations from the dataset as follows.

'require' data_loader.rb
#first column in the csv file contains the label translates to a label_index of 0
table ={:file => '../data/train.csv' , :label_index => 0}) 

observation = table.sample 

features = observation.features 
label = observation.label

NeuralNet is initialized in 1 of 2 modes. ‘train’ or ‘eval’. This option is passed as an argument when launching neural_net.rb from the terminal with

ruby neural_net.rb --mode 'train'


ruby neural_net.rb --mode 'eval'

Initializing with train loads up the csv or a cache of the DataTable object and invokes the train method. eval invokes the create_test_submission method which will run through the test.csv and make label predictions based on the model saved to disk earlier by the train method.

Main Methods

The NeuralNet object runs on 3 main methods train , forward and backprop. Next up we will be looking at each of these methods.


train initializes fresh weights every time it is called. It then enters an infinite loop, getting random samples from the train dataset and applying a forward pass and backward pass on it to get the weight updates.

sample -> forward -> backprop -> repeat

After training conditions have been met, the train method breaks and the trained weights are saved to disk.

def train
  puts "Entered Training"
  i = 0
  start_time =
  loop do 
    # forward pass in the network with a random observation from @dt.sample.
    # eval_or_train is passed as train becuase the forward method has to pass
    # it's results to the backprop method. The backprop method will update the weights
    forward(@dt.sample,{:eval_or_train => 'train'})

    ave_error_history = running_average(1000,@error_history)
    ave_error_history_5000 = running_average(5000,@error_history)
    ave_classification_history = running_average(1000,@classification_history)
    ave_classification_history_5000 = running_average(5000,@classification_history)
    ratio = (ave_classification_history  / ave_classification_history_5000)

    puts "Running Average Error (1000) => #{ave_error_history}"
    puts "Running Average Error (5000) => #{ave_error_history_5000}"
    puts "Running Average Classification (1000) => #{ave_classification_history} "
    puts "Running Average Classification (5000) => #{ave_classification_history_5000}"
    puts "Classification Runninge Average Ratio => #{ratio}"
    puts "Iteration = #{i}"
    puts "---"

    if ratio < 1.0 and i > 60000
      finish_time =
      puts'../data/w1.txt','w'){|f| f << Marshal.dump(@w1.to_a)}
      puts'../data/w2.txt','w'){|f| f << Marshal.dump(@w2.to_a)}
      puts "Total training time was: #{(finish_time - start_time).round(0)} sec"

  i += 1


The forward method consists of a series of matrix operations. The goal of the forward pass it to propagate the input vector forward to end up with a probability distribution of the digit set.

def forward(observation,opts ={})
  #convert the features array into a NMatrix matrix and divide every element by 255. 
  #the division scales down the input.The input vector is initialized with size 
  #1 bigger than the @input_size. This is to accommodate the bias term 
  a1 = observation.features.flatten.to_nm([1,@input_size + 1]) / 255.0

  #Set the bias term equal to 1
  #NMatrix 2 dimensional matrices can be accessed via [row,column]
  a1[0, @input_size ] = 1.0

  #pass the product of the input values and the arc weight forward 
  #and sum the product up at each node
  z2 =

  #apply the activation function to the sum vector element wise
  a2 = activation_function(z2,@hidden_func)

  #resize the hidden layer to add the bias unit
  a2_with_bias = NMatrix.zeroes([1,@hidden_nodes+1])
  a2_with_bias[0,0..@hidden_nodes] = a2
  a2_with_bias[0,@hidden_nodes] = 1.0 

  #z3 = a2 x @w2, propogating the hidden layer forward to get the sums in the output layer
  z3 =
  #Softmax activation function in the output layer
  a3 = activation_function(z3,@output_func)
  #if in training mode, pass values of layers to backprop. 
  #otherwise return the prediction the output layer
  if opts[:eval_or_train] == 'train'
  elsif opts[:eval_or_train] == 'eval'
    return a3.each_with_index.max[1]


backprop is the final piece in the puzzle. It receives all it’s parameters from forward and then makes the weight adjustments that will result in a better classification rate. How the weight update works exactly is outside the scope of this tutorial.

def backprop(a1,a2_with_bias,z2,z3,a3,label)
  #initiates the output vector of zeroes
  y = NMatrix.zeroes([1,10])

  #set the label from the data to 1
  #only 1 element can be 1 at a time as classes
  #are mutually exclusive
  y[0,label] = 1.0
  #derivative of the loss function. Difference between predicted
  #values and the true value  
  d3 = -(y - a3)  

  #using the derivative d3 is a good enough measure to 
  #see if the cost is decreasing so we append it to
  #the error history
  @error_history <<  d3.transpose.abs.sum[0]
  #add 1 to the classification history if the prediction
  #is correct, otherwise zero
  @classification_history <<  (a3.each_with_index.max[1] == label ? 1.0 : 0.0)
  # derivative, has the same size as the hidden layer. The range [] operator
  # excludes the bias node. No error is passed to the bias node.  
  d2 = d3.transpose )[0..(@hidden_nodes-1)] * derivative(z2.transpose,@hidden_func)    
  #matrix with dimensions equal to @w1's dimensions
  #each element contains the gradient of the weight
  #with respect to the cost function. If the weights
  #are reduced by a small fraction of this value the cost function
  #will go down     
  grad1 =
  #same for @w2 
  grad2 =

  # updating the weigh matrices. The first layer is updated
  # by a factor of 10 less than than the second layer. for numerical
  # stability. Big weight changes -> big weights -> equals big sums -> saturated neurons
  @w1 = @w1 - grad1.transpose * @alpha  * 0.1         
  @w2 = @w2 - grad2.transpose * @alpha  


Now that you have better idea of the internals. Go back to the QuickStart section and train a network yourself. I went ahead and made a submission to Kaggle based this tutorial’s implementation and achieved a classification rate of 95% which is not bad for a standard network.

Kaggle MNIST Leaderboard

You can create your own submission by running…

ruby neural_net.rb --mode 'train'
ruby neural_net.rb --mode 'eval'

The submission csv file will be in your data folder.

You can play around with different hidden node sizes when training by passing the following options…

ruby neural_net.rb --mode train --hidden_nodes x

Lower weight counts make for faster training, so if you get frustrated with the speed. Stop and start again with fewer hidden units.

In follow up tutorials we will try to get a better classification rate and also speed up the execution of the program.