In the last two lecturs, we discuss a general framework for learning, neural networks.
The current AI wave came in 2012 when AlexNet (60 million parameters) cuts the error rate of ImageNet competition (classify 1.2 million natural images) by half.
Elements of Statistical Learning (ESL) Chapter 11: https://web.stanford.edu/~hastie/ElemStatLearn/.
Stanford CS231n: http://cs231n.github.io.
On the origin of deep learning by Wang and Raj (2017): https://arxiv.org/pdf/1702.07800.pdf
Aka single layer perceptron (SLP) or single hidden layer back-propagation network.
Sum of nonlinear functions of linear combinations of the inputs, typically represented by a network diagram.
Mathematical model: \[\begin{eqnarray*} Z_m &=& \sigma(\alpha_{0m} + \alpha_m^T X), \quad m = 1, \ldots, M \\ T_k &=& \beta_{0k} + \beta_k^T Z, \quad k = 1,\ldots, K \\ Y_k &=& f_k(X) = g_k(T), \quad k = 1, \ldots, K. \end{eqnarray*}\]
Output layer: \(Y=(Y_1, \ldots, Y_K)\) are \(K\)-dimensional output. For univariate response, \(K=1\); for \(K\)-class classification, \(k\)-th unit models the probability of class \(k\).
Input layer: \(X=(X_1, \ldots, X_p)\) are \(p\)-dimensional input features.
Hidden layer: \(Z=(Z_1, \ldots, Z_M)\) are derived features created from linear combinations of inputs \(X\).
\(T=(T_1, \ldots, T_K)\) are the output features that are directly associated with the outputs \(Y\) through output functions \(g_k(\cdot)\).
\(g_k(T) = T_k\) for regression. \(g_k(T) = e^{T_k} / \sum_{k=1}^K e^{T_k}\) for \(K\)-class classification (softmax regression).
Number of weights (parameters) is \(M(p+1) + K(M+1)\).
Activation function \(\sigma\):
\(\sigma(v)=\) a step function: human brain models where each unit represents a neuron, and the connections represent synapses; the neurons fired when the total signal passed to that unit exceeded a certain threshold.
Rectifier. \(\sigma(v) = v_+ = \max(0, v)\). A unit employing the rectifier is called a rectified linear unit (ReLU). According to Wikipedia: The rectifier is, as of 2018, the most popular activation function for deep neural networks.
Given training data \((X_1, Y_1), \ldots, (X_n, Y_n)\), the loss function \(L\) can be:
Sum of squares error (SSE): \[ L = \sum_{i=1}^n \sum_{k=1}^K [y_{ik} - f_k(x_i)]^2. \]
Cross-entropy (deviance): \[ L = - \sum_{i=1}^n \sum_{k=1}^K y_{ik} \log f_k(x_i). \]