Neural Networks - RBF Networks and Recurrent Networks
In today’s lecture, we are going to look at two different types of neural networks. First, we are going to introduce an architecture that uses local units in the input layer, and then we are going to look at recurrent neural networks, where the weights between neurons can form cycles.
RBF networks
Neural functions with local units based on Radial Basis Functions (RBF networks) represent an interesting alternative to perceptron neural networks. Unlike perceptron neural networks, the neurons in the input layer of RBF networks calculate some function depending on the distance from their “weight” vector (the word “weight” is in quotation marks because there are no weights - typically this vector is called “center” ). We can therefore express the calculation of these neurons as $y = \rho(||x-c||)$, where $\rho$ is some activation function (kernel), $c_i$ is the center of the neuron, $x_i$ is the input and $||\cdot||$ is some norm - often the Euclidean distance, but other ones can be used. The Gaussian function $\rho(x) = e^{-\beta x^2}$ is often used as a kernel.
A typical RBF network has only one layer of neurons (perceptrons) after the RBF layer. So the whole network computes the function \(\varphi(x) = \sum_{i=1}^N w_i \rho(x, c),\) where $w_i$ are the weights for the output layer (in the case of multiple output neurons, we have a matrix of weights).
Note that the activation function in the RBF network is local, i.e. it has the largest value for inputs that are close to the center of the neuron and for inputs further from the center their activation decreases to 0.
Training of RBF networks is typically done in two phases, first using some clustering algorithm (e.g. $k$-means, see below) the centers of the neurons in the input layer are set, then the $\beta$ parameter in each neuron is calculated as \(\beta = \frac{1}{2\sigma^2},\) where $\sigma$ is the average distance of the objects in the respective cluster from its center. Finally, the output layer weights are trained. Here, it is sufficient to gradually present the inputs from the training set, compute the activations of the input layer, and then use algorithms for linear regression. The output layer only calculates a linear combination of activations of the input layer.
An alternative to the described training procedure can be the use of a gradient method for setting the centers of neurons and weights in the output layer.
$k$-means algorithm
We mentioned that clustering algorithms are used to find the centers of neurons in the input layer, most often the so-called $k$-means algorithm. It is an algorithm that tries to find clusters in the input data - groups of data that are close to each other, but at the same time far from other groups. It is an unsupervised learning algorithm, so it does not use any target values of $y$. The algorithm receives as input the data vectors $x_i$ and the number $k$, which indicates the required number of clusters. The algorithm starts by randomly choosing $k$ cluster centers of $m_j$ (often as $k$ samples from the input data). Then it repeats the phase of matching and recalculating centers. In the assignment phase, each sample from the input data is assigned to the cluster whose center is closest to it. In the center recalculation phase, each center is recalculated as the center of the data cluster it belongs to. These two phases are repeated until the algorithm converges or for a specified number of iterations.
Recurrent neural networks
In the neural networks that we have mentioned so far, the connections always lead in only one direction, and the entire neural network therefore forms an acyclic graph. Such neural networks behave completely reactively and their previous inputs do not affect the following inputs in any way. This is suitable for many common problems, such as image classification, etc., but if the inputs to the neural network are sequences of different lengths where individual items interact, it is more appropriate to add some internal state to the neural network, typically represented by a connection that leads back. Networks with such connections are called recurrent and are used, for example, for predicting time series, machine translation, or text generation (although for tasks in natural language processing the transformer architecture is more common nowadays).
The simplest recurrent network can be a network with only one input, one output, and one neuron in a single hidden layer. This neuron will then have a recurrent connection to itself. The meaning of this connection is that when an input is presented, the neuron receives, in addition to its normal input, its activation in the previous step (when the previous input was presented). So a neuron actually calculates its current activation as a weighted sum of its input and its previous output. In more general recurrent networks, the meaning is similar - recurrent weights transmit the state from the previous time step.
The calculation of the recurrent network happens in such a way that we gradually present it with inputs and it calculates outputs based on them (and on its activations in the previous steps). We handle the outputs according to what we want to do with the network. If, for example, it is to predict the value of a time series, we first present all the inputs and only then look at the output. When translating, we first present a sentence in one language and then wait for the network to generate a sentence in the target language. When generating the text, we let the network generate words/letters and give it its previous outputs as input. Anyway, for the sequence of inputs, we get a sequence of outputs, which we can compare with some desired sequence during training.
Training of recurrent neural networks
The back-propagation algorithm that we showed last time can be used for training of recurrent neural networks, but the network needs to be first expanded in time, i.e. copied for each input (and output) step. This modified version of the algorithm is called backpropagation through time.
There can be problems with training recurrent connections. When we realize how the backpropagation algorithm works, we will see that in it the gradient from the previous layer is always multiplied by the weight. If this is a recurring weight, it is repeatedly multiplied by the same value. If this value is greater than 1, the gradients grow uncontrollably (exploding gradients), if it is less than 1, they decrease to 0 (vanishing gradients). Both situations lead to the algorithm not working very well.
This problem can be solved in two ways - either the recurrent part of the network is not trained at all, as in the so-called Echo state networks, or the recurrent weight is fixed at 1 and the state of the network is managed explicitly, as in the so-called Long-short term memory networks.
Echo state networks
Echo state networks (ESNs) have a large recurrent layer at the input, which has weights initialized randomly, and these are not trained in any way. Typically, this layer is implemented as a single $k \times m$ matrix, where $k = n + m$, $m$ is the size of the recurrent layer, and $n$ is the number of inputs. The network works by receiving vectors of length $n$ as input and appending its internal state of length $m$ to them. It applies its recurrent layer to it and thus gets a new internal state. The recurrent layer typically contains some activation function to make it non-linear (it is applied to all activations - the results of multiplying by a random matrix). After the recurrent layer, traditional ESNs contain only one layer, which, similarly to RBF networks, can be trained using linear regression or using the gradient method.
ESN may look strange at first glance, why should a random matrix give reasonable results? An important observation is that due to the size of the internal state (which is often larger than the number of inputs), the matrix actually randomly transforms the information from the input into a higher dimensional space. Other layers then actually try to decode this information and get the necessary information from it. Note also that the internal state of the network depends on all previous inputs, not just the last one, and thus contains information about all of them.
Long short term memory
Long short term memory (LSTM) networks solve the training problem a little differently. They replace each neuron with a so-called LSTM cell that explicitly works with its state (memory), and recurrent connections between individual cells then have a weight fixed at 1.
The input of each cell is its state $c_{t-1}$ from the previous step and the concatenation of its outputs from the previous state and the new inputs $[h_{t-1}, x_{t}]$. Based on the inputs, the values of the so-called gates are first calculated - forget gate $f_t$ and input gate $i_t$. Both are calculated similarly as
\[f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)\] \[i_t = \sigma(W_i[h_{t-1}, x_t] + b_i),\]where $W_f, W_i, b_f$, and $b_i$ are the weights and biases for the forget and input gates, respectively, and $\sigma$ is the sigmoid function. From the same values, we also calculate a candidate for a new internal state as
\[\bar{C_t} = \tanh(W_c[h_{t-1}, x_t] + b_c),\]where again $W_c$ are the weights and $b_c$ are the thresholds for the state calculation. The new state of the cell is then calculated as
\[C_t = f_t \ast C_{t-1} + i_t \ast \bar{C_t},\]where the operation $\ast$ means coordinate-wise multiplication.
Finally, we calculate the output of cell $h_t$. This is calculated from the current state of the cell and its input using the output gate $o_t$ as
\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\] \[h_t = o_t \ast \tanh(C_t)\]where again $W_o$ and $b_o$ are the weights and thresholds for the output gate.
The transfer of information between individual time steps in LSTM networks takes place using the internal state. There is no weight directly associated with this, so we can avoid problems with exploding and vanishing gradients when propagating the error in the network. Thanks to this property, LSTM networks can remember longer sequences of inputs than the basic types of recurrent networks.
Further Reading
The information in this part is not necessary for the exam, but it may further explain some of the things mentioned above.
- A Practical Guide to Applying Echo State Networks - nice decription of the ESN and guide to applying them
- The Unreasonable Effectiveness of Recurrent Neural Networks - brief description of recurrent networks and some examples how they can be used to generate text one character at a time