Neural Networks

Announcement
Acknowledgement
Convolutional neural networks (CNN)
Example: handwritten digit recognition
Example: image classification
- Other popular architectures
Recurrent neural networks (RNN)
Generative Adversarial Networks (GANs)
Software
TensorFlow
R/RStudio
Example: MNIST - MLP
Example: MNIST - CNN
Example: IMDB review sentiment analysis - RNN LSTM
Youtube free course

Announcement

Stay calm, safe and healthy
Remember your homework and report

Acknowledgement

Dr. Hua Zhou’s slides

Convolutional neural networks (CNN)

Sources: https://colah.github.io/posts/2014-07-Conv-Nets-Modular/

Fully connected networks don’t scale well with dimension of input images. E.g. \(1000 \times 1000\) images have about \(10^6\) input units, and assuming you want to learn 1 million features (hidden units), you have about \(10^{12}\) parameters to learn!
In locally connected networks, each hidden unit only connects to a small contiguous region of pixels in the input, e.g., a patch of image or a time span of the input audio.
Convolutions. Natural images have the property of being stationary, meaning that the statistics of one part of the image are the same as any other part. This suggests that the features that we learn at one part of the image can also be applied to other parts of the image, and we can use the same features at all locations by weight sharing.

Consider \(96 \times 96\) images. For each feature, first learn a \(8 \times 8\) feature detector (or filter or kernel) from (possibly randomly sampled) \(8 \times 8\) patches from the larger image. Then apply the learned detector to all \(8 \times 8\) regions of the \(96 \times 96\) image to obtain one \(89 \times 89\) convolved feature for that feature.

From Wang and Raj (2017):
Pooling. For a neural network with 100 hidden units, we have \(89^2 \times 100 = 792,100\) convolved features. This can be reduced by calculating the mean (or max) value of a particular feature over a region of the image. These summary statistics are much lower in dimension (compared to using all of the extracted features) and can also improve results (less over-fitting). We call this aggregation operation pooling, or sometimes mean pooling or max pooling (depending on the pooling operation applied).
Convolutional neural network (CNN). Convolution + pooling + multi-layer neural networks.

Example: handwritten digit recognition

Input: 256 pixel values from \(16 \times 16\) grayscale images. Output: 0, 1, …, 9, 10 class-classification.
A modest experiment subset: 320 training digits and 160 testing digits.
net-1: no hidden layer, equivalent to multinomial logistic regression. Number of parameters is \((16 \times 16 + 1) \times 10 = 2570\).
net-2: one hidden layer, 12 hidden units fully connected. Number of parameters is \((16 \times 16 + 1) \times 12 + (13 \times 10) = 3214\).
net-3: two hidden layers locally connected. Each unit of the first hidden layer takes input from a \(3 \times 3\) patch; neighboring patches overlap by by one row or column. Each unit of the second hidden layer takes input from a \(5 \times 5\) patch; neighboring patches are two units apart. Number of parameters is \((3 \times 3 + 1) \times 64 + (5 \times 5 + 1) \times 16 + (16 + 1) \times 10 = 1226\).
net-4: two hidden layers, locally connected with weight sharing. \((3 \times 3 + 64) \times 2 + (5 \times 5 + 1) \times 16 + (16 + 1) * 10 = 1148\) (???).
net-5: two hidden layers, locally connected, two levels of weight sharing (was the result of many person years of experimentation).

Results (320 training cases, 160 test cases):

network	links	weights	accuracy
net 1	2570	2570	80.0%
net 2	3124	3214	87.0%
net 3	1226	1226	88.5%
net 4	2266	1131	94.0%
net 5	5194	1060	98.4%

Net-5 and similar networks were state-of-the-art in early 1990s.

On the larger benchmark dataset MNIST (60,000 training images, 10,000 testing images), accuracies of following methods were reported:

Method Error rate

tangent distance with 1-nearest neighbor classifier 1.1%

degree-9 polynomial SVM 0.8%

LeNet-5 0.8%

boosted LeNet-4 0.7%

Method	Error rate
tangent distance with 1-nearest neighbor classifier	1.1%
degree-9 polynomial SVM	0.8%
LeNet-5	0.8%
boosted LeNet-4	0.7%

Example: image classification

Source: http://cs231n.github.io/convolutional-networks/

AlexNet: Krizhevsky, Sutskever, Hinton (2012)
ImageNet dataset. Classify 1.2 million high-resoultion images (\(224 \times 224 \times 3\)) into 1000 classes.
A combination of techniques: GPU, ReLU, DropOut (0.5), SGD + Momentum with 0.9, initial learning rate 0.01 and again reduced by 10 when validation accuracy become flat.
5 convolutional layers, pooling interspersed, 3 fully connected layers. 60 million parameters, 650,000 neurons.

AlexNet was the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) classification the benchmark in 2012.
Achieved 62.5% accuracy:

96 learnt filters:

Other popular architectures

Source: Architecture comparison of AlexNet, VGGNet, ResNet, Inception, DenseNet

VGGNet was the runner up of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) classification the benchmark in 2014.

ResNet secured 1st Position in ILSVRC and COCO 2015 competition with just error rate of 3.6% of error rate. (Better than Human Performance !!!) Batch Normalization after every conv layer. It also uses Xavier initialization with SGD + Momentum. The learning rate is 0.1 and is divided by 10 as validation error becomes constant. Moreover, batch-size is 256 and weight decay is 1e-5. The important part is there is no dropout is used in ResNet.

Inception. Inception-v3 with 144 crops and 4 models ensembled, the top-5 error rate of 3.58% is obtained, and finally obtained 1st Runner Up (image classification) in ILSVRC 2015.

DenseNet.

Recurrent neural networks (RNN)

Sources:
MLP (multi-layer perceptron) and CNN (convolutional neural network) are examples of feed forward neural network, where connections between the units do not form a cycle.
MLP and CNN accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes).
Reccurent neural networks (RNN) instead have loops, which can be un-rolled into a sequence of MLP.

RNNs allow us to operate over sequences of vectors: sequences in the input, the output, or in the most general case both.
Applications of RNN:
- Language modeling and generating text. E.g., search prompt, messaging/email prompt, …
Above: generated (fake) LaTeX on algebraic geometry; see http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
- NLP/Speech: transcribe speech to text, machine translation, sentiment analysis, …
- Computer vision: image captioning, video captioning, …
RNNs accept an input vector \(x\) and give you an output vector \(y\). However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in the past.
Short-term dependencies: to predict the last word in “the clouds are in the sky”:
Long-term dependencies: to predict the last word in “I grew up in France… I speek fluent French”:
Typical RNNs are having trouble with learning long-term dependencies.
Long Short-Term Memory networks (LSTM) are a special kind of RNN capable of learning long-term dependencies.

The cell state allows information to flow along it unchanged.

The gates give the ability to remove or add information to the cell state.

Generative Adversarial Networks (GANs)

The coolest idea in deep learning in the last 20 years.
- Yann LeCun on GANs.

Sources:
Applications:
- Image-to-image translation
- AI-generated celebrity photos: https://www.youtube.com/watch?v=G06dEcZ-QTg
- Self play

GAN:

Value function of GAN \[ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]. \]
Training GAN

Software

High-level software focuses on user-friendly interface to specify and train models.
Keras, PyTorch (only Linux and MacOS), scikit-learn, …
Lower-level software focuses on developer tools for impelementing deep learning models.
TensorFlow, Theano, CNTK, Caffe, Torch, …
Most tools are developed in Python plus a low-level language.

TensorFlow

Developed by Google Brain team for internal Google use. Formerly DistBelief.
Open sourced in Nov 2015.
OS: Linux, MacOS, and Windows (since Nov 2016).
GPU support: NVIDIA CUDA.
TPU (tensor processing unit), built specifically for machine learning and tailored for TensorFlow.
Mobile device deployment: TensorFlow Lite (May 2017) for Android and iOS.

Used in a variety of Google apps: speech recognition (Google assistant), Gmail (Smart Reply), search, translate, self-driving car …

when you have a hammer, everything looks like a nail.

Machine Learning Crash Course (MLCC). A 15 hour workshop available to public since March 1, 2018.

R/RStudio

R users can access Keras and TensorFlow via the keras and tensorflow packages.

#install.packages("keras")
library(keras)
install_keras()
# install_keras(tensorflow = "gpu") # if NVIDIA GPU is available

Example: MNIST - MLP

Rmd, html.

Example: MNIST - CNN

Rmd, html.

Example: IMDB review sentiment analysis - RNN LSTM

Rmd, html.

Youtube free course

Keras - Python Deep Learning Neural Network API

Neural Networks

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Nov 2, 2020

Announcement

Acknowledgement

Convolutional neural networks (CNN)

Example: handwritten digit recognition

Example: image classification

Other popular architectures

Recurrent neural networks (RNN)

Generative Adversarial Networks (GANs)

Software

TensorFlow

R/RStudio

Example: MNIST - MLP

Example: MNIST - CNN

Example: IMDB review sentiment analysis - RNN LSTM

Youtube free course