Recurrent Neural Networks From Scratch Part 2 – Preparing the Data

Introduction

In the previous blog post we looked at how a recurrent neural network differs from a feedforward neural network, and why they are better at sequence processing tasks.

In this blog post, we will be preparing our training data, and start writing some code.

Our training data is just a list of lowercase names. The letters are all lowercase for simplicity, but obviously the network would work if you chose to train it on names with title case. I’ve put a link to the file at the bottom of this blog post for those following at home.

Sequential Data and Time Steps

In the last post we talked a lot about sequential data and “time steps”, but I did not discuss what exactly that will look like in our network. To be concrete about it, let’s look at one piece of training data.

Consider the name “sam”.

The RNN I will be building will be a “character-level” RNN. This means that each element in the sequence will be a single character.

$x^{(0)} = \text{s}$

$x^{(1)} = a$

$x^{(2)} = m$

Even though we refer to them as “time steps” there need not be any temporal interpretation of the data. As long as one thing sensibly comes after another, the data is sequential. In this case our time steps are characters.

For language processing there are other ways you might break the sequence down. If you’re processing larger passages of text you might break it down into words or sub-word tokens. We’ll stick with letters.

Preparing the Data

First thing’s first we need to load our data:

names = open('names.txt', 'r').read().splitlines()

To actually process these sequences in the network, we need to deal with numbers, not characters. The first step in that is to create a mapping of letters to numbers, and vice versa:

chars = sorted(set(''.join(names)))

# Mapping chars to integers
ch_to_i = {ch: i for i, ch in enumerate(chars)}
index_to_char = {i : ch for i, ch in enumerate(chars)}

Where the sequence ends is a piece of information we’d like our network to learn. If our network never learned when to end a sequence it would just be spitting out characters forever!

The simplest way to encode this is to add a special “end of sequence” character. During training, we’ll tell the network that it should be trying to produce this character as output whenever it reaches the end of one of our example names.

import numpy as np

# Add special ending character
ch_to_i['<E>'] = len(chars)
index_to_char[len(chars)] = '<E>'

Now we have a design decision to make: how do we actually feed our characters to the neural net?

The simplest thing to do would be to simply feed the integer value associated with the character itself into the neural net. For example, ‘a’ maps to 0 in our ch_to_i dictionary, so we could just feed that into the net.

This is inadvisable for several reasons:

Because some numbers are higher or lower than others, that implies that there is some order to the characters. Is ‘b’ “greater” than ‘a’? Not in any meaningful way.
It also implies “distances” between characters that isn’t necessarily meaningful. Is ‘z’ much “further away” from ‘a’ than ‘y’? Again, not in any meaningful way.
It complicates how we interpret the output of our network.

For these reasons we will choose a different approach that is still very simple: one-hot encoded vectors.

Each one-hot encoded vector is a vector with as many entries as there are characters we could choose from (lower-case letters and the special EOS character). It will be all zeros except for the entry which corresponds to the character it represents.

For example ‘a’ is the first letter in our ch_to_i mapping, so the one-hot encoded vector for it will look like this:

$\begin{bmatrix} 1 \\ 0 \\ 0 \\ \ldots \\ 0 \end{bmatrix}$

‘b’ will look like this:

$\begin{bmatrix} 0 \\ 1 \\ 0 \\ \ldots \\ 0 \end{bmatrix}$

And so on.

This solves all of the problems we would face if we were to use a simple integer representation:

There is no longer an ordinal relationship between the characters. The magnitudes of the one-hot vectors are all the same.
All one-hot vectors are the same distance ( $\sqrt{2}$ ) from each other
The output layer of our network can now simply be the same size as our one-hot encoded vectors. The output neuron with the highest activation will be interpreted as the predicted class. For example, if the first entry has the highest value, our model guessed that the next character in the sequence is ‘a’.

We could have chosen to represent our characters a different way. Especially when working at the word-level, it can be very beneficial to use a technique called “word-embedding”.

If you’re working at the word-level it’s likely that you have many more words in your dictionary than the number of letters our character-level model is dealing with. In that case it can be totally impractical to one-hot encode your words because the one-hot vector will be highly dimensional.

A better alternative is to encode the word into a lower dimensional vector. That vector’s values will be real valued (not just 0 and 1). This can allow you to model complicated relationships between words, and represent many more words in a compact way. The distances between characters might not have been meaningful, but the distances between word-embeddings can be. For example the distance between the word vector for “King” and “Prince” should probably be shorter than the distance between two totally unrelated words. You can even let the training process of your neural net learn the correct word embeddings for you.

All that being said, we’ll stick with one-hot encoded vectors.

We’ll encode each name in our training data as a matrix/tensor. Each row in our tensor will represent a letter in the name, and it will be a one-hot encoded vector:

def character_tensor(ch):
    tensor = np.zeros(len(ch_to_i))
    tensor[ch_to_i[ch]] = 1
    return tensor
    
def name_tensor(name):
    tensor = np.zeros((len(name), len(ch_to_i)))
    for chi in range(len(name)):
        ch = name[chi]
        tensor[chi] = character_tensor(ch)
    return tensor

We’ll need something else to train our neural net: the targets/labels. In another supervised learning context you might be given some data and the correct output along with it. For example, you might be given the pixel data for a hand-written character and the target would be the correct classification of that character.

If you notice, our data is just a list of names. We can get out targets by simply shifting the input one character to the right and appending our special “end of sequence character”.

For example, if the input name was sam, our targets would be the one-hot encoded vectors corresponding to am<E>. When we give our network the letter ‘s’ we will expect it to output the letter ‘a’. When we give it an ‘s’ followed by an ‘a’ we expect it to output an ‘m’, and so on.

def target_tensor(name):
    tensor = np.zeros((len(name), len(ch_to_i)))
    for chi, ch in enumerate(name[1:]):
        tensor[chi] = character_tensor(ch)
    tensor[chi + 1][ch_to_i['<E>']] = 1
    return tensor
    
def get_training_example(name):
    return name_tensor(name), target_tensor(name)

inputs = [get_training_example(name) for name in words]

Conclusion

In this blog post we took a look at our training data and, after evaluating the alternatives, decided upon one-hot encoded vectors as a representation of our characters.

We’ve processed our data into labeled training examples now, so we’re ready to begin training our neural network. In the next post, we’ll be talking about the forward pass of our recurrent neural network.

Thank you for reading!

Here is the list of names I used to generate the training data:

names Download

Credit goes to Andrej Karpathy for creating this training data set.