Neural Networks and Deep Learning - Part I (Introduction)

Data science
History and recent surge
Learning sources
Single layer neural network (SLP)
Multi-layer neural network (MLP)
Expressivity of neural network
Universal approximation properties
Practical issues
Convolutional neural networks (CNN)
Example: handwritten digit recognition
Example: image classification
- Other popular architectures
Recurrent neural networks (RNN)
Generative Adversarial Networks (GANs)

Data science

In the last two lecturs, we discuss a general framework for learning, neural networks.

History and recent surge

From Wang and Raj (2017):

The current AI wave came in 2012 when AlexNet (60 million parameters) cuts the error rate of ImageNet competition (classify 1.2 million natural images) by half.

Learning sources

Elements of Statistical Learning (ESL) Chapter 11: https://web.stanford.edu/~hastie/ElemStatLearn/.
UFLDL: http://ufldl.stanford.edu/tutorial/.
Stanford CS231n: http://cs231n.github.io.
On the origin of deep learning by Wang and Raj (2017): https://arxiv.org/pdf/1702.07800.pdf

Single layer neural network (SLP)

Aka single layer perceptron (SLP) or single hidden layer back-propagation network.
Sum of nonlinear functions of linear combinations of the inputs, typically represented by a network diagram.

Mathematical model: \[\begin{eqnarray*} Z_m &=& \sigma(\alpha_{0m} + \alpha_m^T X), \quad m = 1, \ldots, M \\ T_k &=& \beta_{0k} + \beta_k^T Z, \quad k = 1,\ldots, K \\ Y_k &=& f_k(X) = g_k(T), \quad k = 1, \ldots, K. \end{eqnarray*}\]
- Output layer: \(Y=(Y_1, \ldots, Y_K)\) are \(K\)-dimensional output. For univariate response, \(K=1\); for \(K\)-class classification, \(k\)-th unit models the probability of class \(k\).
- Input layer: \(X=(X_1, \ldots, X_p)\) are \(p\)-dimensional input features.
- Hidden layer: \(Z=(Z_1, \ldots, Z_M)\) are derived features created from linear combinations of inputs \(X\).
- \(T=(T_1, \ldots, T_K)\) are the output features that are directly associated with the outputs \(Y\) through output functions \(g_k(\cdot)\).
- \(g_k(T) = T_k\) for regression. \(g_k(T) = e^{T_k} / \sum_{k=1}^K e^{T_k}\) for \(K\)-class classification (softmax regression).
- Number of weights (parameters) is \(M(p+1) + K(M+1)\).
Activation function \(\sigma\):
- \(\sigma(v)=\) a step function: human brain models where each unit represents a neuron, and the connections represent synapses; the neurons fired when the total signal passed to that unit exceeded a certain threshold.
- Sigmoid function: \[ \sigma(v) = \frac{1}{1 + e^{-v}}. \]
- Rectifier. \(\sigma(v) = v_+ = \max(0, v)\). A unit employing the rectifier is called a rectified linear unit (ReLU). According to Wikipedia: The rectifier is, as of 2018, the most popular activation function for deep neural networks.
- Softplus. \(\sigma(v) = \log (1 + \exp v)\).
Given training data \((X_1, Y_1), \ldots, (X_n, Y_n)\), the loss function \(L\) can be:
- Sum of squares error (SSE): \[ L = \sum_{i=1}^n \sum_{k=1}^K [y_{ik} - f_k(x_i)]^2. \]
- Cross-entropy (deviance): \[ L = - \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log f_k(x_i). \]

image source

Model fitting: back-propagation (gradient descent)
- Consider sum of squares error and let \[\begin{eqnarray*} z_{mi} &=& \sigma(\alpha_{0m} + \alpha_m^T x_i) \\ R_i &=& \sum_{k=1}^K [y_{ik} - f_k(x_i)]^2. \end{eqnarray*}\]
- The derivatives: \[\begin{eqnarray*} \frac{\partial R_i}{\partial \beta_{km}} &=& -2 [y_{ik} - f_k(x_i)] g_k'(\beta_k^T z_i) z_{mi} \equiv \delta_{ki} z_{mi} \\ \frac{\partial R_i}{\partial \alpha_{ml}} &=& - 2 \sum_{k=1}^K [y_{ik} - f_k(x_i)] g_k'(\beta_k^T z_i) \beta_{km} \sigma'(\alpha_m^T x_i) x_{il} \equiv s_{mi} x_{il}. \end{eqnarray*}\]
- Gradient descent update: \[\begin{eqnarray*} \beta_{km}^{(r+1)} &=& \beta_{km}^{(r)} - \gamma_r \sum_{i=1}^n \frac{\partial R_i}{\partial \beta_{km}} \\ \alpha_{ml}^{(r+1)} &=& \alpha_{ml}^{(r)} - \gamma_r \sum_{i=1}^n \frac{\partial R_i}{\partial \alpha_{ml}}, \end{eqnarray*}\] where \(\gamma_r\) is the learning rate.
- Back-propagation equations \[ s_{mi} = \sigma'(\alpha_m^T x_i) \sum_{k=1}^K \beta_{km} \delta_{ki}. \]
- Two-pass updates: \[\begin{eqnarray*} & & \text{initialization} \to \widehat{f}_k(x_i) \quad \quad \quad \text{(forward pass)} \\ &\to& \delta_{ki} \to s_{mi} \to \widehat{\beta}_{km} \text{ and } \widehat{\alpha}_{ml} \quad \quad \text{(backward pass)}. \end{eqnarray*}\]
- Advantages: each hidden unit passes and receives information only to and from units that share a connection; can be implemented efficiently on a parallel architecture computer.

Stochastic gradient descent (SGD). In real machine learning applications, training set can be large. Back-propagation over all training cases can be expensive. Learning can also be carried out online — processing each batch one at a time, updating the gradient after each training batch, and cycling through the training cases many times. A training epoch refers to one sweep through the entire training set.

AdaGrad and RMSProp improve the stability of SGD by trying to incorpoate Hessian information in a computationally cheap way.
Neural network model is a projection pursuit type additive model: \[ f(X) = \beta_0 + \sum_{m=1}^M \beta_m \sigma(\alpha_{m0} + \alpha_M^T X). \]

Multi-layer neural network (MLP)

Aka multi-layer perceptron (MLP).
1 hidden layer:
2 hidden layers:

Expressivity of neural network

Playground: http://playground.tensorflow.org
Sources:
- On the expressive power of deep neural network.
- On the number of response regions of deep feed forward networks with piece-wise linear activations.
Consider the function \(F: \mathbb{R}^m \mapsto \mathbb{R}^n\) \[ F(\mathbf{v}) = \text{ReLU}(\mathbf{A} \mathbf{v} + \mathbf{b}). \] Each equation \[ \mathbf{a}_i^T \mathbf{v} + b_i = 0 \] creates a hyperplane in \(\mathbb{R}^m\). ReLU creates a fold along that hyperplane. There are a total of \(n\) folds.
- When there are \(n=2\) hyperplanes in \(\mathbb{R}^2\), 2 folds create 4 pieces.
- When there are \(n=3\) hyperplanes in \(\mathbb{R}^2\), 3 folds create 7 pieces.
The number of linear pieces of \(\mathbb{R}^m\) sliced by \(n\) hyperplanes is \[ r(n, m) = \sum_{i=0}^m \binom{n}{i} = \binom{n}{0} + \cdots + \binom{n}{m}. \]

Proof: Induction using the recursion \[ r(n, m) = r(n-1, m) + r(n-1, m-1). \]
Corollary:
- When there are relatively few neurons \(n \ll m\), \[ r(n,m) \approx 2^n. \]
- When there are many neurons \(n \gg m\), \[ r(n,m) \approx \frac{n^m}{m!}. \]
Counting the number of flat pieces with more hidden layers is much harder.

Universal approximation properties

Boolean Approximation: an MLP of one hidden layer can represent any boolean function exactly.
Continuous Approximation: an MLP of one hidden layer can approximate any bounded continuous function with arbitrary accuracy.
Arbitrary Approximation: an MLP of two hidden layers can approximate any function with arbitrary accuracy.

Practical issues

Neural networks are not a fully automatic tool, as they are sometimes advertised; as with all statistical models, subject matter knowledge should and often be used to improve their performance.

Starting values: usually starting values for weights are chosen to be random values near zero; hence the model starts out nearly linear (for sigmoid), and becomes nonlinear as the weights increase.
Hanin and Rolnick argue that a proper choice of the net and the initial random weights have to meet two requirements:
1. Initial weights have carefully chosen variance \(\sigma^2\), which controls the mean of the computed weights. The He uniform in Keras makes the choice \(\sigma^2 = 2 / \text{fan-in}\), where the fan-in is the maximum number of inputs to neurons.
2. The hidden layers in the neural net have enough neurons (not too narrow). The layer widths control the variance of the weighs.
Scaling of inputs: mean 0 and standard deviation 1. With standardized inputs, it is typical to take random uniform weights over the range [−0.7,+0.7].
Overfitting (too many parameters):
1. early stopping;
2. weight decay by \(L_2\) penalty
  \[ L(\alpha, \beta) + \frac{\lambda}{2} \left( \sum_{k, m} \beta_{km}^2 + \sum_{m, l} \alpha_{ml}^2 \right). \] \(\lambda\) is the weight decay parameter.
3. Dropout. At each training case, individual nodes are either dropped out of the net with probability \(1-p\) or kept with probability \(p\), so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Forward and backpropagation for that training case are done only on this thinned network.
  
  Figure from Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov (2014).
How many hidden units and how many hidden layers: guided by domain knowledge and experimentation.
Multiple minima: try with different starting values.

Convolutional neural networks (CNN)

Sources: https://colah.github.io/posts/2014-07-Conv-Nets-Modular/

Fully connected networks don’t scale well with dimension of input images. E.g. \(1000 \times 1000\) images have about \(10^6\) input units, and assuming you want to learn 1 million features (hidden units), you have about \(10^{12}\) parameters to learn!
In locally connected networks, each hidden unit only connects to a small contiguous region of pixels in the input, e.g., a patch of image or a time span of the input audio.
Convolutions. Natural images have the property of being stationary, meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations by weight sharing.

Consider \(96 \times 96\) images. For each feature, first learn a \(8 \times 8\) feature detector (or filter or kernel) from (possibly randomly sampled) \(8 \times 8\) patches from the larger image. Then apply the learned detector to all \(8 \times 8\) regions of the \(96 \times 96\) image to obtain one \(89 \times 89\) convolved feature for that feature.
From Wang and Raj (2017):
Pooling. For a neural network with 100 hidden units, we have \(89^2 \times 100 = 792,100\) convolved features. This can be reduced by calculating the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We call this aggregation operation pooling, or sometimes mean pooling or max pooling (depending on the pooling operation applied).
Convolutional neural network (CNN). Convolution + pooling + multi-layer neural networks.

Example: handwritten digit recognition

Input: 256 pixel values from \(16 \times 16\) grayscale images. Output: 0, 1, …, 9, 10 class-classification.
A modest experiment subset: 320 training digits and 160 testing digits.
net-1: no hidden layer, equivalent to multinomial logistic regression. Number of parameters is \((16 \times 16 + 1) \times 10 = 2570\).
net-2: one hidden layer, 12 hidden units fully connected. Number of parameters is \((16 \times 16 + 1) \times 12 + (13 \times 10) = 3214\).
net-3: two hidden layers locally connected. Each unit of the first hidden layer takes input from a \(3 \times 3\) patch; neighboring patches overlap by by one row or column. Each unit of the second hidden layer takes input from a \(5 \times 5\) patch; neighboring patches are two units apart. Number of parameters is \((3 \times 3 + 1) \times 64 + (5 \times 5 + 1) \times 16 + (16 + 1) \times 10 = 1226\).
net-4: two hidden layers, locally connected with weight sharing. \((3 \times 3 + 64) \times 2 + (5 \times 5 + 1) \times 16 + (16 + 1) * 10 = 1148\) (???).
net-5: two hidden layers, locally connected, two levels of weight sharing (was the result of many person years of experimentation).

Results (320 training cases, 160 test cases):   

| network | links | weights | accuracy |
|---------|-------|---------|----------|
| net 1   | 2570  | 2570    | 80.0%    |
| net 2   | 3124  | 3214    | 87.0%    |
| net 3   | 1226  | 1226    | 88.5%    |
| net 4   | 2266  | 1131    | 94.0%    |
| net 5   | 5194  | 1060    | 98.4%    |  

Net-5 and similar networks were start-of-the-art in early 1990s.

On the larger benchmark dataset MNIST (60,000 training images, 10,000 testing images), accuracies of following methods were reported:

Method Error rate

tangent distance with 1-nearest neighbor classifier 1.1%

degree-9 polynomial SVM 0.8%

LeNet-5 0.8%

boosted LeNet-4 0.7%

Method	Error rate
tangent distance with 1-nearest neighbor classifier	1.1%
degree-9 polynomial SVM	0.8%
LeNet-5	0.8%
boosted LeNet-4	0.7%

Example: image classification

Source: http://cs231n.github.io/convolutional-networks/

AlexNet: Krizhevsky, Sutskever, Hinton (2012)
ImageNet dataset. Classify 1.2 million high-resoultion images (\(224 \times 224 \times 3\)) into 1000 classes.
A combination of techniques: GPU, ReLU, DropOut (0.5), SGD + Momentum with 0.9, initial learning rate 0.01 and again reduced by 10 when validation accuracy become flat.
5 convolutional layers, pooling interspersed, 3 fully connected layers. 60 million parameters, 650,000 neurons.

AlexNet was the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) classification the benchmark in 2012.
Achieved 62.5% accuracy:

96 learnt filters:

Other popular architectures

Source: Architecture comparison of AlexNet, VGGNet, ResNet, Inception, DenseNet

VGGNet was the runner up of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) classification the benchmark in 2014.

ResNet secured 1st Position in ILSVRC and COCO 2015 competition with just error rate of 3.6% of error rate. (Better than Human Performance !!!) Batch Normalization after every conv layer. It also uses Xavier initialization with SGD + Momentum. The learning rate is 0.1 and is divided by 10 as validation error becomes constant. Moreover, batch-size is 256 and weight decay is 1e-5. The important part is there is no dropout is used in ResNet.

Inception. Inception-v3 with 144 crops and 4 models ensembled, the top-5 error rate of 3.58% is obtained, and finally obtained 1st Runner Up (image classification) in ILSVRC 2015.

DenseNet.

Recurrent neural networks (RNN)

Sources:
MLP (multi-layer perceptron) and CNN (convolutional neural network) are examples of feed forward neural network, where connections between the units do not form a cycle.
MLP and CNN accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes).
Reccurent neural networks (RNN) instead have loops, which can be un-rolled into a sequence of MLP.

RNNs allow us to operate over sequences of vectors: sequences in the input, the output, or in the most general case both.
Applications of RNN:
- Language modeling and generating text. E.g., search prompt, messaging/email prompt, …
Above: generated (fake) LaTeX on algebraic geometry; see http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
- NLP/Speech: transcribe speech to text, machine translation, sentiment analysis, …
- Computer vision: image captioning, video captioning, …
RNNs accept an input vector \(x\) and give you an output vector \(y\). However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in the past.
Short-term dependencies: to predict the last word in “the clouds are in the sky”:
Long-term dependencies: to predict the last word in “I grew up in France… I speek fluent French”:
Typical RNNs are having trouble with learning long-term dependencies.
Long Short-Term Memory networks (LSTM) are a special kind of RNN capable of learning long-term dependencies.

The cell state allows information to flow along it unchanged.

The gates give the ability to remove or add information to the cell state.

Generative Adversarial Networks (GANs)

The coolest idea in deep learning in the last 20 years.
- Yann LeCun on GANs.

Sources:
Applications:
- Image-to-image translation
- AI-generated celebrity photos: https://www.youtube.com/watch?v=G06dEcZ-QTg
- Self play

GAN:

Value function of GAN \[ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]. \]
Training GAN