Neural Networks - Introduction
In previous lectures, we have discussed evolutionary algorithms as a nature-inspired optimization technique. Now we will leave evolutionary algorithms aside for a moment and look at a technique inspired by the workings of the nervous system - neural networks.
Neural networks are currently experiencing a boom in the field of deep learning and are performing very well on many machine learning problems. In particular, the so-called convolutional networks that are used for image processing are very popular. Before we get to the description of these more complex models, let’s look at the very basics. Today we will describe the perceptron and the multilayer perceptron.
Machine learning and data preparation
Neural networks are very often used for classification or regression in machine learning. So before we get to what neural networks are, let us take a brief look at what form the data can take and how to work with it as input or output of a machine learning model.
Data sets typically contain inputs in the form of vectors (we’ll denote them by $x = (x_1, \dots, x_n)$) and the desired outputs as (typically) a single value ($y$). However, different data sets differ in the type of data for each attribute $x_i$ and the target value $y$. In general, the inputs and outputs can be of three types - numeric, ordinal and categorical.
Numerical attributes can be used directly as inputs to most models. However, if different attributes have different ranges, it is usually better to rescale the data in some way. Typical approaches include scaling each attribute to the interval $[0,1]$, or so-called normalization, i.e., subtracting the mean and dividing by the standard deviation. In the former case, we need not rely on the values always being between 0 and 1 - for example, we may later get a value larger than the maximum in the training set, then the value after scaling will be larger than 1.
Categorical attributes (e.g. text descriptions) are attributes that can be given a value from some finite set. At the same time, there is no ordering between the values. Such attributes need to be converted to numeric before they can be used. If a given categorical attribute can have $n$ values, it is typically encoded using a vector of $n$ values from the set ${0, 1}$, with only the position in the vector that corresponds to the attribute value set to 1 (e.g., if the attribute has the $i$-th value from the set, only the $i$-th coordinate of the vector is set to 1).
Ordinal attributes are somewhere between numeric and categorical and can be encoded using any of the methods.
The type of target values determines the type of problem to be solved - categorical target values lead to classification (determining which of the classes an object belongs to), numerical values lead to regression (predicting the value). Their encoding during training is the same as for the attributes, but we need to keep in mind that the encoding needs to be inverted again after the model outputs its prediction so that the result makes sense.
Perceptron
The basic unit in a neural network is a single neuron. A perceptron is then an algorithm (neural network) with a single such neuron - a perceptron. This artificial neuron is remotely inspired by natural neurons. It has several inputs $x_i$ and one output. Each of the inputs is assigned a weight $w_i$. The perceptron first computes its activation as a weighted sum of the inputs $\xi = b + \sum_{i=1}^n w_i x_i$. The $b$ is the so-called bias and affects how much the weighted sum must be greater than 0 for the peceptron to activate. If this activation is greater than the threshold $0$, the perceptron output is $1$, otherwise it is $0$.
Explicit handling of the bias $b$ is not needed, and very often the inputs are modified by adding a single coordinate with a constant value of $1$. The thresholds are then directly part of the weights. The perceptron calculation can then be written as $f(\sum_{i=0}^n w_i f_i)$, where $f$ is a function that returns 0 for $x < 0$ and 1 otherwise.
In addition to the variant that returns the binary values $0$ and $1$, a variant that returns the so-called bipolar values $-1$ and $1$ is often used. The only difference is in the function $f$.
Training a perceptron is very simple - one sequentially presents inputs $(x, y)$ from the training set and updates the weights according to the equation \(w_i = w_i + r (y - f (x))x_i,\) where $r$ is a learning rate that determines how fast the weight vector changes. Note that the expression in parentheses is 0 if the perceptron’s output is correct and 1 or -1 if it is incorrect.
It can be shown that the algorithm described in this way converges if the classes in the data are so-called linearly separable, i.e., they can be separated by a hyperplane ($D-1$ dimensional plane in $D$ dimensional space, e.g. line in 2D and plane in 3D) in the input space.
In some cases, the classes for the perceptron are encoded as $-1$ and $1$. Then perceptron training can also be viewed as a gradient optimization of the error function $-\sum y_i(x_i^Tw_i)$.
Multilayer perceptrons (feedforward neural networks)
The most famous type of neural network, the multilayer perceptron, or feedforward neural network, is formed by connecting several perceptrons. This consists of the layers of perceptrons described above, where the inputs of the first (input) layer are directly the data from the training (testing) set and the inputs of the other (hidden and output) layers are the outputs from the previous layer.
In multilayer networks, activation functions $f$ other than those mentioned above are usually used. The main reason is that neural networks are typically trained using gradient algorithms, and the functions mentioned above have a gradient of 0 almost always. Instead, what is traditionally used is the so-called logistic sigmoid - $f(x) = \frac{1}{1+e^{-\lambda x}}$. Nowadays, the ReLU (rectified linear unit) function, defined as $f(x) = \max(0, x)$ is also often used.
The gradient method is used for training. The gradient of the loss function $L(x,y|w)$ with respect to the weights of the neural network is computed and the weights are then adjusted according to the relation \(w_i = w_i - \alpha\frac{\partial L(x,y \| w)}{\partial w_i},\) where $\alpha$ is the learning parameter.
Currently, libraries that can calculate the gradient symbolically are most commonly used to calculate it. Practically, all that is needed is to write a forward computation of the network and implement a loss function. Existing tools take care of the gradient calculation. However, for a better understanding of neural networks, it is worth trying to calculate the gradient at least in this simple case.
Let’s start with the calculation for the output layer. Let us denote by $w_{jk}$ the weight between the $j$-th neuron in the last hidden layer and the $k$-th neuron in the output layer. The output of this neuron is $y_k = f(\xi_k)$, where $x_j$ is the output of the $j$-th neuron in the last hidden layer and $\xi_k = \sum w_{jk} x_j$. Considering the error function $E(x,t|w) = \frac{1}{2}\sum_k (y_k - t_k)^2$ ($t$ denotes the vector of desired outputs, $y$ denotes the network output), its derivative with respect to the weight $w^L_{jk}$ is given by the calculation below (we applied the rule for the derivation of the composite function).
\[\frac{\partial E}{\partial w^L_{jk}}=\frac{\partial E}{\partial y_{k}}\frac{\partial y_{k}}{\partial\xi_{k}}\frac{\partial\xi_{k}}{\partial w^L_{jk}}=(y_{k}-t_{k})x_{j}\frac{\partial f(\xi_j)}{\partial\xi_{j}}.\]In order to calculate the derivative of the loss function according to the weights in the hidden layers of the neural network, it is useful to imagine what exactly the last two layers of this network calculate. We will denote by $k$ the neurons in the output layer, by $j$ the neurons in the last hidden layer, and by $i$ the neurons in the hidden layer before the last hidden layer. The inputs of these neurons will be labeled as $x_k$, $x_j$, and $x_i$, their activations will in turn be labeled as $\xi$ with the appropriate index, and their activation functions will be labeled as $f$ with that index. We denote the weights between the last hidden and output layer as $w^L$ and the weights between the last two hidden layers as $w^{L-1}$. Thus, the last two layers compute the expression $y_{k}=f_{k}(\sum_{j}w_{jk}^L f_{j} (\sum_i w_{ij}^{L-1} x_{i}))$. If we again consider the same loss function, we get its derivative again according to the rule for derivation of a composite function as
\[\frac{\partial E}{\partial w_{ij}^{L-1}}=\left[\sum_{k}\frac{\partial E}{\partial\xi_{k}}\frac{\partial\xi_{k}}{\partial y_{j}}\right]\frac{\partial y_j}{\partial\xi_{j}}\frac{\partial\xi_{j}}{\partial w_{ij}^{L-1}}=\left[\sum_{k}\delta_{k}w_{jk}^{L}\right]y_{i}\frac{\partial f_{j}(\xi_{j})}{\partial\xi_{j}}.\]We can notice that we have already calculated the first part of the sum ($\frac{\partial E}{\partial\xi_{k}}$) in the calculation for the output layer. So let’s denote it by $\delta_k$ and this gives us a simplified version of the expression after the last $=$. Now, just note that we have not used anywhere the information that $k$ denotes the output layer and we can therefore use the same expression for all hidden layers. If we still define here \(\delta_{j}=\frac{\partial E_{v}}{\partial\xi_{j}}=\frac{\partial f_{j}(\xi_{j})}{\partial\xi_{j}}\sum\delta_{k}w_{jk}\), we get the well-known adaptation rule for training neural network weights (the definition of $\delta_j$ varies depending on whether it is a hidden or output layer): \(w_{ij}^{k+1}=w_{ij}^{k}-\alpha\delta_{j}y_{i}.\)
We have not yet discussed the derivation of the activation function with respect to its input. It can be shown that, for example, for a sigmoid it is $\lambda y_{j}(1-y_{j})$. Also, so far we have only computed the derivative for one input instance - but for multiple instances we only get the sum of these derivatives.