# RNN: Recurrent Neural Networks

In normal feed-forward neural networks the activation flows only in one direction, from the input layer to the output layer, eventually passing through a set of hidden layers. Conversely, recurrent neural networks (RNN) have also connections pointing backward, thus allowing them to take also the temporal dimension into account. This novel architecture enables them to take as their input not just the current input xi but to take into account the state or context (which it’s just an alternative way to refer to the previous value of the hidden layer) hi-1 observed in the previous time-step ti-1 as we can see in the figure below.

The above figure depicts an unrolled representation of a recurrent network but in practice the network is just one with the same weights shared across all the time-steps. Only the inputs and the previous contexts can be different for each time-stamp. In addition, at the first iteration, since we don’t have any, the previous context will be just composed by a vector a zeros, thus not influencing the result of the first output. Thus, if we take a RNN focused on a single time-step it will look like the following picture.

A part of a neural network that preserves some state across time-steps is called a memory cell. Thus, a single layer of recurrent neurons such as hi is a very basic form of memory cell. Now, diving a little deeper into the mathematical details, it will result the new hidden state hi to be computed as:

$h_i=\phi(W_x x_i+W_h h_{i_1}+b_h)$

Where i is the index of the time-step, xi is the corresponding input and Wx is its associated weights matrix. Similarly, hi-1 is the previous hidden state and Wh is its corresponding weights matrix. As usually, bh is the bias term that will be added to the final sum and ϕ it’s a generic activation function such as tanh, softmax or ReLu. Conversely, the final output yi is calculated as:

$y_i=F(W_o h_i+b_o)$

In this case F is the output activation function, Wo are the weights associated to the output layer and bi is its corresponding bias term. RNNs are trained with a technique similar to the standard backpropagation used to train normal feed-forward networks, but a little different since it takes into account the presence of the backward connections. Given its additional complexity, this new backpropagation algorithm will be discussed later in the next section.

Since RNNs take as input also their previous state which recursively depends on the past states, we can say that recurrent networks maintain a sort of memory. This is the main advantage they have over the standard feed-forward ones since they can exploit this information from the past to better predict the future. Moreover, this particular architecture has a stronger analogy to how life beings’ brain works: it tries to find correlations, also known as long-term dependencies, between past situations in order to better understand an event that is happening in the present.

Thanks to their interesting features, RNNs reveal to be very useful in a wide range of applications. In fact, their capability to predict the future make them suitable to time-series analysis and prediction tasks such as stocks prediction or sequences generation. Furthermore, since these kind of networks can work with arbitrary length sequences, they can find also application in the natural language processing field. In fact, RNNs have been largely used in systems such as automatic translation, speech-to-text translation or sentiment analysis. Surprisingly, they can also be used in computer vision performing image classification by decomposing an image into flatten patches and then treating it as a sequence classification task.

### Backpropagation through time

As already mentioned before, the training algorithm for recurrent networks relies on an extension of backpropagation called backpropagation through time (BPTT). First of all, let’s briefly remind that supervised backpropagation algorithm’s goal is to modify the weights of a neural network in order to minimize the error of the network outputs compared to some expected outputs in response to corresponding inputs. Extending this concept, we can interpret backpropagation through time as a generalization of the same algorithm but specifically applied to recurrent neural networks. The first step consists in unrolling the network creating a copy of it for each time-step where each copy will have its own input, hidden layer and output, but always keeping all the weights shared.

Afterwards, a cost function E to be minimized in order to train the network is calculated as the sum over all errors Ei computed at each time-step i:

$E(y',y)=\sum_{i=1}^{T}E(y_i',y_i)$

Where the cost function E can be any cost function such as cross-entropy for classification or mean squared error for regression. Moreover, yi is the network’s output at time-step i and yi‘ is the corresponding expected output. Now, using the chain rule we can start by calculating the gradients of any output error Ei respect to the output weights matrix Wo:

$\frac{\partial Ei}{\partial W_o}=\frac{\partial E_i}{\partial y_i}\frac{\partial y_i}{\partial W_o}$

Now it comes the tricky part. We have to update also the weights matrices Wh and Wx. Nevertheless, Wh doesn’t depend only on hi but also on hi-1, hi-2 and so on until h1. Basically it depends on all the previous hidden states. Hence, all of these gradients have to be accumulated to update Wh.

$\frac{\partial E_i}{\partial W_h}=\sum_{k=1}^{i}\frac{\partial E_i}{\partial y_i}\frac{\partial y_i}{\partial h_i}\frac{\partial h_i}{\partial h_k}\frac{\partial h_k}{\partial W_h}$

$\frac{\partial h_i}{\partial h_k}=\prod_{j=k+1}^{i}\frac{\partial h_j}{\partial h_{j-1}}$

Let’s make things more clear with an example where i=3, that is, we are going to compute the gradient of Wh respect to E3:

$\frac{\partial E_3}{\partial W_h}=\frac{\partial E_3}{\partial y_3}\frac{\partial y_3}{\partial h_3}\frac{\partial h_3}{\partial W_h}+\frac{\partial E_3}{\partial y_3}\frac{\partial y_3}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial W_h}+\frac{\partial E_3}{\partial y_3}\frac{\partial y_3}{\partial h_3}\frac{\partial h_3}{\partial h_2}\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial W_h}$

Finally, we can compute the gradient of Ei respect to Wx in the usual way:

$\frac{\partial Ei}{\partial W_x}=\frac{\partial E_i}{\partial y_i}\frac{\partial y_i}{\partial x_i}\frac{\partial x_i}{\partial X_h}$

Considering that these long calculus have to be repeated for each time-step in order to obtain all the partial losses Ei and that input sequences are often comprised of thousands of time-steps, it would require a huge number of derivatives to perform a single weights update. Thus, BPTT results to be a quite inefficient algorithm in terms of computation time. However, this is not even its main drawback. As we are going to see in the next section, BPTT can suffer of other two more serious problem which could severely affect the training process.

### Vanishing / exploding gradients problem

It’s well known that an amount slightly greater than one can become immeasurably large when multiplied by itself several times. Conversely, an amount slightly lower than one can quickly plummet toward zero when repeatedly multiplied by itself. Similarly, since deep neural networks’ layers relate to each other through multiplication, their derivatives are susceptible to vanishing or exploding gradient as well. In addition, we have just observed how long chain rules are involved in the backpropagation through time stacking several derivatives multiplications on top of each other, thus making vanishing or exploding gradient problem even more serious than it was in normal backpropagation. To further investigate this problem, let’s carefully analyze the equation used to compute the value of the hidden layer hi:

$h_i=\phi(W_x x_i+W_h h_{i_1}+b_h)$

Exploiting a simple parametrization, we can rewrite the above formula as:

$h_i=W_x x_i+W_h\phi(h_{i_1})+b_h$

Now, calculating the derivative of hi over hi-1 we obtain:

$\frac{\partial h_i}{\partial h_{i-1}}=W_h diag[\phi'(h_{i_1})]$

Where ρ’ is simply the derivative of ρ, which is a diagonal matrix containing all derivatives ρ'(hi-11) to ρ'(hi-1n) along its diagonal where n is the number of neurons in the hidden layer. Next, reminding that when applying the l2 norm to the products of two matrices A and B it results that |A*B|<= |A|*|B|, we can rewrite the above formula as:

$||\frac{\partial h_i}{\partial h_{i-1}}||<=||W_h||*||diag[\phi'(h_{i_1})]||$

Anyway, in the previous section we also calculated that:

$\frac{\partial h_i}{\partial h_k}=\prod_{j=k+1}^{i}\frac{\partial h_j}{\partial h_{j-1}}$

Finally, combining the last two formulas, it can be proven that:

$\prod_{j=k+1}^{i}\frac{\partial h_j}{\partial h_{j-1}}<=(\beta_w \beta_h)^{i-k}$

Where βx and βh are defined as the upper bounds of Wh and diag[ρ'(hi-1)] respectively. In the same way a product of k real numbers can shrink to zero or explode to infinity, so can a product of matrices can too. Therefore:

Anyway, exploding gradients problem is easier to solve because gradients can just be truncated or clipped in order to prevent them to keep growing. On the other hand, vanishing gradients problem is much harder to deal with since gradients can become too small and it’s not clear how to solve this issue. Thus, the vanishing gradient problem can make the model forget information from past while the exploding gradients problem, if not correctly handled, can make the training of the model fails to converge.

### Truncated backpropagation through time

One way to ease the problems presented in the previous section consist in using a variant of the BPTT algorithm called truncated backpropagation through time (TBPTT). As its name suggests, this new variant doesn’t involve the whole sequence every time in the training process but it only takes into account the most recent time-steps. In particular, it processes the sequence one time-step at a time, and every k1 time-steps, it runs BPTT only on the last k2 time-steps, so a parameter update can be cheap if k2 is small. This algorithm is then characterized by two hyper-parameters k1 and k2:

• k1: The number of forward-pass time-steps between weights updates. This parameter influences how slow or fast training will be, given how often weight updates are performed.
• k2: How many time-steps will be involved in the BPTT. Generally, it should be large enough to capture the temporal structure in the problem for the network to learn. If too large it can lead to vanishing gradients problem.

Although this technique seems a very promising way to solve vanishing / exploding gradients problem, anyway, since gradients can only flow back for a limited number of time-steps due to the truncation, consequently the network can’t learn dependencies that are too far in the past as it could be possible with BPTT. Given the many drawbacks presented in the vanilla version of recurrent neural networks, the way to deal with the vanishing gradient problem is to use more complex recurrent hidden units like GRU and LSTM. The main ideas behind these models is to propose a better memory mechanism to capture long distance dependencies which enable models to remember the important information and forget the unimportant ones.

## One thought on “RNN: Recurrent Neural Networks”

1. Liu Yan Ping says:

Difficult to understand.

Like