Deep Learning

## Introduction

This module will introduce you to deep learning and deep neural networks using PyTorch, a Python-based open source deep learning package created by Facebook. We will use PyTorch to introduce both fully connected deep neural networks (FCNs) and convolutional deep neural networks (CNNs) for image classification tasks.

## Setup

Before you can use PyTorch, you must download and install its relevant modules in Anaconda. To do this, open an Anaconda command prompt and issue the command

conda install pytorch torchvision -c pytorch

You can review the instructions on the PyTorch web site for more information, as well as commands you can issue to verify the install has been performed correctly.

## Deep Neural Networks

Deep neural networks (DNNs) are a relatively new area of focused research, based off the foundations of artificial neural networks (ANNs). ANNs and DNNs fall within the broad area of supervised machine learning algorithms. The original idea for a neural network was proposed by McCulloch and Pitts in 1943, based off a biological estimate of how neurons in the brain were hypothesized to function. The problem with the McCulloch-Pitts model was its inability to easily learn. To address this, Rosenblatt proposed the Perceptron, adding weights that allowed the ANN to "learn" by increasing and decreasing weights based on how well the current network categorized relative to a known, correct label (i.e., supervised learning).

In 1959, Widrow and Hoff developed MADALINE at Stanford University to remove echoes on phone lines. Interestingly, MADALINE is still in use. Careful analysis of MADALINE showed that it found a set of weights for a number of inputs, which is analogous to linear regression. In other words, ANNs at this point could not solve more complex, non-linear problems. Marvin Minsky and Seymour Papert at MIT proved this theoretically, effectively silencing research on neural networks for many years.

Moving into the 1980s, multilayer neural networks with hidden layers were developed. This was critical, since it provides one of the key advantages of a DNN: the ability to automatically create relevant features from the initial inputs. This is also where the terminology "deep" comes from. The intuition is that each hidden layer in a DNN uses results from the previous layer, allowing it to start with highly detailed features, then use those to proceed to identify more abstract elements. This led to another problem, however. Although it was understood how to train single-layer ANNs, it was not know how to adjust weights, biases, and activations on a multi-layer DNN.

To address this, the idea of backpropagation, which distributes error throughout the network, was proposed by Rumelhart, Hinton, and Williams in 1986. In simple terms, we are using calculus to assign some of the blame for error in the output layer to each neuron in the previous hidden layer, further propagating error at that hidden layer to its parent and so on. Initially, stochastic gradient descent was used to find an optimal set of weights to minimize error, although other approaches have now been proposed. Backpropagation is performed at the possible cost of significantly increased training times. Because of this, we now use clusters of graphics processing units (GPUs) which support 1000s of parallel operations to train a DNN.

DNNs have been used to solve previously intractable problems in a wide range of areas, including:

• image recognition,
• speech recognition,
• game playing,
• control systems,
• driver-less cars, and
• unmanned aerial systems

The basic structure of a DNN is made up of: an input layer; two or more hidden layers of neurons of some size; an output layer; edges connecting neurons between adjacent layers; activation values and biases at each neuron; weights on each edge.

This example shows a FCN with two hidden layers. More complicated DNNs like convolutional neural networks precede the FCN with a set of convolution operations to extract features which are selected, filtered, and then used as an input layer into a final FCN for classification. We will begin our exploration of DNNs with a PyTorch example that classifies images of handwritten numbers from 0 to 9 using a simple FCN.

### Neurons

Neural networks, including deep neural networks, are made up of specialized perceptrons called neurons. A perceptron is an element that receives binary input(s) $$\{ x_1, \ldots, x_n \}$$ and generates a binary output.

Weights $$\{ w_1, \ldots, w_n \}$$ are normally applied to the inputs to define their importance. If the sum of the inputs multiplied by their weights exceeds a threshold or bias $$b_i$$, the perceptron fires 1, otherwise it fires 0.

$\text{output} = \begin{cases} 0 \; \text{if} \; \sum_i{w_i\,x_i} \leq b_i \\ 1 \; \text{if} \; \sum_i{w_i\,x_i} > b_i \end{cases}$

We will normally simplify this notation by moving the bias inside the equation and using a dot product for the sum.

$\text{output} = \begin{cases} 0 \; \text{if} \; w \cdot x - b \leq 0 \\ 1 \; \text{otherwise} \end{cases}$

Bias $$b_i$$ is a measure of how easy or difficult it is to make a perceptron fire. The larger $$b_i$$, the more input or activation perceptron i needs to fire.

Although useful, perceptrons are inconvenient since both their inputs and outputs are binary. To address this, we normally convert perceptrons to neurons. Neurons accept continuous inputs and generate continuous outputs. To do this, neurons apply a function like sigmoid to produce smooth output over a fixed range. For example, "sigmoid" neurons accept continuous input and apply $$\sigma$$ to the weighted sum of activation plus bias to produce continuous output on the range $$0, \ldots, 1$$.

\begin{align} z & = w \cdot x + b \\ \text{output} & = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + \text{exp}(-(w \cdot x + b))} \end{align}

Intuitively, if $$z=w \cdot x + b$$ is large, $$e^{-z} \approx 0$$ so $$\sigma(z) \approx 1$$ and if $$z$$ is small $$\sigma(z) \approx 0$$, just like a perceptron. The difference is that $$\sigma$$ is smooth over the range $$0, \ldots, 1$$ so a small change in weights or biases $$\Delta w_i$$ or $$\Delta b_i$$ will produce a small change in output, versus a perceptron's potential to flip its output from 0 to 1 or vice versa.

$\Delta \text{output} \approx \sum_i\frac{ \partial \, \text{output}}{\partial w_i} \Delta w_i + \frac{\partial \, \text{output}}{\partial b_i} \Delta b_i$

That is, $$\Delta$$output is linear in $$\Delta w$$ and $$\Delta b$$.

### Handwritten Number Recognition

PyTorch's dataset repository includes the MNIST database of handwritten images, which includes 60,000 training examples and 10,000 test examples. It is a very common dataset to use to test learning or pattern recognition algorithms on real-world data, without the need to hand-label a large training dataset.

Normally, images are processed with CNNs. However, we will use a simple FCN to recognize the handwritten images. This is done by converting each handwritten digit image into a set of pixel values, then converting that into a one-dimensional vector. This 1D vector acts as input to one or more hidden layers, filters like ReLU (rectified linear unit), and finally to an output layer with ten possible classifications representing the ten digits 0–9. Here are two simple examples of the digits in the MNIST dataset.

In our example neural network to "learn" MNIST images we have a $28 \times 28 = 784$-element input vector of greyscale intensities, an $n=15$ neuron hidden layer, and 10 output values representing a decision on which digit the input image represents. To train and validate the DNN model, MNIST provides 60,000 labeled training images and 10,000 labeled test images, denoted as $x$ and $y$, respectively. Any individual $x_i$ is a 784-length vector and the corresponding $y_i$ is a 10-length output vector with all 0s except for the target digit position, which is 1.

Our goal is to train a neural network so output approximates $y_i = y(x_i) \, \forall x_i$ in the training set. To do this, we define a cost function to measure accuracy or error for any given prediction.

$C(w,b) = \frac{1}{2n} \sum_x || y(x) - a ||^{2}$

where $w$, $b$, $y$, and $a$ are weights, biases, known output vectors (labels), and predicted output vectors (activation) for all training input samples $x$. You should recognize this as the simple quadratic function mean squared error (MSE), where $C(w,b) \rightarrow 0$ as we improve our ability to predict correct output.

Obviously, when we start we expect $C(w,b)$ to be large. To reduce it, we optimize $w$ and $b$ throughout the network using gradient descent. Ignoring for now the specifics of our neural net, how can we minimize $C(v)$ where $v$ are tunable parameters? For simplicity, assume $v=(v_1,v_2)$ is a two-dimensional parameter vector. Plotting $C(v)$ produces a valley-like image.

From any point on the valley, we want to move in a direction with the steepest slope towards the valley bottom (minimum $C$ value). What happens when we move a small amount $\Delta v_1$ in the $v_1$ direction? Or a small amount $\Delta v_2$ in the $v_2$ direction?

$\Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2$

We want $C$ to be negative. Defining $\Delta v = (\Delta v_1, \Delta v_2)^{T}$ and gradient of descent \nabla C =\,( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} ), we have $\Delta C = \nabla C \cdot \Delta v$ Now, if we pick \epsilon s.t. $\Delta v = -\epsilon \nabla C$ where \epsilon is the learning rate and therefore \Delta v is a small movement in the direction of steepest gradient, then $\Delta C = \nabla C \cdot -\epsilon \nabla C = -\epsilon || \nabla C ||^{2}$ Since ||\nabla C\,||^{2} is always positive, \Delta C is always negative. Setting v \rightarrow v^{\prime} = v - \epsilon \nabla C over and over, we will converge on a minimum C, as long as \epsilon is not too big, causing us to "jump" back and forth over the minimum, or too small, causing convergence to be too expensive. Moreover, if we expand v into a vector of m > 2 variables v=(v_1, \ldots, v_m), this approach continuous to work. \begin{align} \Delta C & = \nabla C \cdot \Delta v \\ \nabla C & = (\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}) \\ \Delta v & = -\epsilon \nabla C \\ v \rightarrow v^{\prime} & = v - \epsilon \nabla C \\ \end{align} How do we extend this general gradient descent approach to our specific goal of optimizing C in a neural network. We simply rewrite the above equations in terms of w and b. \begin{align} w_i \rightarrow w_i^{\prime} & = w_i - \epsilon \frac{\partial C}{\partial w_i} \\ b_i \rightarrow b_i^{\prime} & = b_i - \epsilon \frac{\partial C}{\partial b_i} \\ \end{align} We can now walk backwards through the layers in the neural network, adjusting (backpropegating) weights and biases for each neuron. To do this, recall C = \frac{1}{n} \sum_x C_x and C_x =\,\frac{||y_i(x_i) - a_i||^{2}}{2}. To compute $\nabla C$ we compute $\Delta C_x \; \forall x$ and average the result.

$\nabla C = \frac{1}{n} \sum_x \nabla C_x$

If the number of neurons $n$ is very large, we may want to sample $\nabla C_x$. This is called stochastic gradient descent, where we randomly choose $m \ll n$ training inputs to form a mini-batch $(x_1, \ldots, x_m)$, assuming

\begin{align} \sum_{i=1}^{m} \frac{\nabla C_{x_i}}{m} & \approx \sum_x \frac{\nabla C_x}{n} = \nabla C \\ \nabla C & \approx \frac{1}{m} \sum_{i=1}^{m} \nabla C_{x_i} \\ w_i \rightarrow w_{i}^{\prime} & = w_i = \frac{\epsilon}{m} \sum_i \frac{\partial C_{x_i}}{\partial w_i} \\ b_i \rightarrow b_{i}^{\prime} & = b_i = \frac{\epsilon}{m} \sum_i \frac{\partial C_{x_i}}{\partial b_i} \\ \end{align}

We then pick another mini-batch from the remaining samples, repeat the above process and continue until the training set is exhausted. This is defined as one training epoch.

## Backpropegation

To begin, we review and simplify the feed-forward component of DNN training to use matrix notation.

• $w_{jk}^{\ell}$, weight from neuron $k$ in layer $(\ell-1)$ to neuron $j$ in layer $\ell$
• $b_{j}^{\ell}$, bias at neuron $j$ in layer $\ell$
• $a_{j}^{\ell}$, activation at neuron $j$ in layer $\ell$

$\therefore$ $a_{j}^{\ell} = \sigma(\sum_k w_{jk}^{\ell} a_{k}^{(\ell-1)} + b_{j}^{\ell})$.

To convert to matrix format, define a weight matrix $w^{\ell}$ for layer $\ell$ where $w^{\ell}$ are weights connecting to layer $\ell$'s neurons.

$w^{\ell} = \underbrace{ \begin{bmatrix} w^{\ell}[j,k] \\ = w_{jk}^{\ell} \end{bmatrix} }_{k\text{ input weights}} \;\;\Bigg\}\;{\scriptsize j\text{ neurons}}$

Similarly for layer $\ell$, $b^{\ell}$ is a bias vector with $b_{j}^{\ell}$ the bias for neuron $j$ in layer $\ell$. $a^{\ell}$ is an activation vector with $a_{j}^{\ell}$ the activation for neuron $j$ in layer $\ell$.

Finally, we use vectorization to apply the activation function to weights times previous layer activations plus biases.

$\sigma \left( \begin{bmatrix} x_1 \\ \cdots \\ x_m \end{bmatrix} \right) = \begin{bmatrix} \sigma(x_1) \\ \cdots \\ \sigma(x_m) \end{bmatrix}$

Combining all of these, for layer $\ell$

$a^{\ell} = \sigma( w^{\ell} a^{(\ell-1)} + b^{\ell})$

The value $z^{\ell} = w^{\ell} a^{(\ell-1)}+b^{\ell}$ is important, so we explicitly extract it and define it as the weighted input to neurons in layer $\ell$. We can now write

\begin{align} a^{\ell} & = \sigma(z^{\ell})\\ z_{j}^{\ell} & = \sum_k w_{jk}^{\ell} a^{(\ell-1)} + b_{j}^{\ell} \end{align}

### Cost Function Assumptions

The goal of backpropegation is to compute $\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$ for cost function $C$, w.r.t. weights $w$ and biases $b$ in the network. To do this, we make two assumptions about $C$. You can assume our $C$ is MSE, $C = \frac{1}{2n} \sum_x || y(x) - a^{L}(x)||^2$, where

• $n$, total number of training examples;
• $\sum_x$, sum over individual examples $x$;
• $y=y(x)$, expected output;
• $L$, number of layers in network, and
• $a^{L}=a^{L}(x)$, vector of activation outputs for input $x$.
1. Cost function $C$ can be written as an average $C=\frac{1}{n} \sum_x C_{x}$ over individual cost functions $C_x$ for individual training examples $x$. This is required because backpropegation computes individual $\frac{\partial C_x}{\partial w}$ and $\frac{\partial C_x}{\partial b}$, then recovers $\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$ by averaging the individual partial derivatives.
2. Cost function $C$ can be written as a function of outputs from the neural network.

### Elementwise Product (Hadamard or Schur Product)

For two vectors $s$ and $t$, the elementwise product $s \odot t$ is $(s \odot t) = s_{j} t_{j}$, e.g.,

$\begin{bmatrix} 1 \\ 2 \end{bmatrix} \odot \begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 1 \cdot 3 \\ 2 \cdot 4 \end{bmatrix} = \begin{bmatrix} 3 \\ 8 \end{bmatrix}$

This is know as the Hadamard or Schur product.

### Four Fundamental Backpropegation Equations

At a basic level, backpropegation explains how changing $w$ and $b$ in a network changes $C$. Ultimately, this requires computing $\frac{\partial C}{\partial w_{jk}^{\ell}}$ and $\frac{\partial C}{\partial b^{\ell}}$. To do this, we introduce an intermediate quantity $\delta_{j}^{\ell}$, the error in neuron $j$ in layer $\ell$. Backpropegation computes $\delta_{j}^{\ell}$, then relates it to $\frac{\partial C}{\partial w_{jk}^{\ell}}$ and $\frac{\partial C}{\partial b^{\ell}}$.

Suppose, at some neuron, we modify $z_{j}^{\ell}$ by adding a small change $\Delta z_{j}^{\ell}$. Now, the neuron outputs $\sigma( z_{j}^{\ell} + \Delta z_{j}^{\ell} )$, propegating through follow-on layers in the network and changing the overall cost $C$ by $\frac{\partial C}{\partial z_{j}^{\ell}}$$\Delta z_{j}^{\ell}$. If $\frac{\partial C}{\partial z_{j}^{\ell}}$ is large, we can improve $C$ by choosing $\Delta z_{j}^{\ell}$ with a sign opposite to $\frac{\partial C}{\partial z_{j}^{\ell}}$. If $\frac{\partial C}{\partial z_{j}^{\ell}}$ is small, though, $\Delta z_{j}^{\ell}$ will have little impact on $C$. Intuitively, $\frac{\partial C}{\partial z_{j}^{\ell}}$ is a measure of the amount of error in a neuron.

Given this, we define error $\delta_{j}^{l}$ of neuron $j$ in layer $\ell$ as

$\delta_{j}^{\ell} = \frac{\partial C}{\partial z_{j}^{\ell}}$

As before, $\delta^{\ell}$ is a vector of errors for neurons in layer $\ell$. Backpropegation computes $\delta^{\ell}$ for every layer, then relates $\delta^{\ell}$ to the quantities of real interest, $\frac{\partial C}{\partial w_{jk}^{\ell}}$ and $\frac{\partial C}{\partial b^{\ell}}$

Eq 1: Computing $\delta^{L}$ error in the output layer.

Components of $\delta^{L}$ are

$\delta_{j}^{L} = \frac{\partial C}{\partial a_{j}^{L}} \sigma^{\prime}(z_{j}^{L})$

$\frac{\partial C}{\partial a_j^L}$ measures how fast cost is changing as a function of output activation $j$. $\sigma^{\prime}(z_y^L)$ measures how fast activation function $\sigma$ is changing at $z_y^{L}$. $z_y^{L}$ is computed during feed-forward, so $\sigma^{\prime}(z_y^L)$ is easily obtained. $\frac{\partial C}{\partial a_j^L}$ depends on $C$, but for our MSE cost function $\frac{\partial C}{\partial a_j^L} =\,$ $(a_j^{L} - y_j)$.

To extend $\delta_j^L$ to matrix-based form

$\delta^L = \nabla_a C \odot \sigma^{\prime}(z^L)$

where $\nabla_a C$ is a vector whose components are $\frac{\partial C}{\partial a_j^L}$. $\nabla_a C$ expresses the rate of change of $C$ w.r.t. output activations. In our example $\nabla_a C = (a^L - y)$, so the full matrix form of Eq. 1 is

$\sigma^L = (a^L - y) \odot \sigma^{\prime}(z^L)$

Eq 2: Computing $\delta^{\ell}$ from $\delta^{\ell+1}$

Given $\delta^{\ell+1}$

$\delta^{\ell} = ((w^{\ell+1})^{T} \delta^{\ell+1}) \odot \sigma^{\prime}(z^{\ell})$

where $(w^{\ell+1})^{T}$ is the transpose of the weight matrix $w^{\ell+1}$ for layer $\ell+1$.

Suppose we know the error $\delta^{\ell+1}$ at layer $\ell+1$. Applying $(w^{\ell+1})^{T}$ moves error backwards, giving us some measure of error in layer $\ell$. Applying the Hadamard product $\odot \sigma^{\prime}(z^{\ell})$ pushes error backwards through the activation function in layer $\ell$, giving us backpropegated error $\delta^{\ell}$ in weighted input to layer $\ell$.

We can use this to compute error $\delta^{\ell}$ for any layer $\ell$ by starting with $\delta^{L}$ (Eq. 1), using it to calculate $\delta^{L-1}$, then $\delta^{L-2}$ and so on until $L-i=\ell$.

Eq 3: Rate of change of $C$ w.r.t. bias

It turns out that $\frac{\partial C}{\partial b_j^{\ell}} =\,$ $\delta_j^{\ell}$. That is, error $\delta_j^{\ell}$ is exactly equal to the rate of change of bias $\frac{\partial C}{\partial b_j^{\ell}}$. Since we already know how to compute $\delta_j^{\ell}$, we can rewrite this as

$\frac{\partial C}{\partial b} = \delta$

where $\delta$ is being evaluated at the same neuron as bias $b$.

Eq 4: Rate of change of $C$ w.r.t. weight

Here

$\frac{\partial C}{\partial w_{jk}^{\ell}} = a_k^{\ell-1} \delta_j^{\ell}$

so partial derivative $\frac{\partial C}{\partial w_{jk}^{\ell}}$ depends on $\delta^{\ell}$ and $a^{\ell-1}$, which we already know how to compute. Our equation can be rewritten as

$\frac{\partial C}{\partial w} = a_\text{in} \delta_\text{out}$

where $a_\text{in}$ is activation of the neuron's input to its weight $w$ and $\delta_\text{out}$ is the error of the neuron output for its weight $w$. Examining just weight $w$ and the two neurons connected by that weight

### Backpropegation Algorithm

Combining all four equations, we obtain

1. Input $x$, set $a^1$ for input layer.
2. Feed-forward, for each $\ell = 2, 3, \ldots, L$ compute $z^{\ell} = w^{\ell} a^{\ell-1} + b^{\ell}$ and $a = \sigma(z^{\ell})$.
3. Output error $\delta^L$, then compute vector $\delta^L = \nabla_a C \odot \sigma^{\prime}(z^L)$.
4. Backpropegate error, for each $\ell = L-1, L-2, \ldots, 2$ compute $\delta^{\ell} = ((w^{\ell+1})^{T} \delta^{\ell+1}) \odot \sigma^{\prime}(z^{\ell})$.
5. Result
$\frac{\partial C}{\partial w_{jk}^{\ell}} = a^{\ell-1} \delta_j^{\ell}, \;\;\; \frac{\partial C}{\partial b_j^{\ell}} = \delta_j^{\ell}$

### Full Training Algorithm

1. Input training examples.
2. For each training example $x$
• set input activation $a^{x,1}$,
• feed-forward, for each $\ell=2, 3, \ldots, L$ compute $z^{x,\ell} = w^{\ell}a^{x,\ell-1} + b^{\ell}$ and $a^{x,l} = \sigma(z^{x,l})$,
• output error, compute $\delta^{x,\ell} = \nabla_a C_x \odot \sigma^{\prime}(z^{x,\ell})$
• backpropegate error for each $\ell = L-1, L-2, \ldots, 2$ by computing $\delta^{x,\ell} = ((w^{\ell+1})^{T} \delta^{x,\ell+1}) \odot \sigma^{\prime}(z^{x,\ell})$
3. Gradient descent, for each $\ell = L, L-1, \ldots, 2$ update weights and biases according to rules
\begin{align} w^{\ell} & \rightarrow w^{\ell} - \frac{\eta}{m} \sum_x \delta^{x,\ell} (a^{x,\ell-1})^{T}\\ b^{\ell} & \rightarrow b^{\ell} - \frac{\eta}{m} \sum_x \delta^{x,\ell} \end{align}

## PyTorch

PyTorch is, at its core, a tensor computing language with GPU acceleration support and a deep neural network library. PyTorch began as an internship program by Adam Paszke (now a Senior Research Scientist at Google) in October 2016. It was based on Torch, an open-source machine learning library and scientific computing framework written in the Lua programming language.

Three additional authors (Sam Gross, Soumith Chintala, and Gregory Chanan) formed the original author list. PyTorch was released as an open source machine learning language using the BSD open source license. Facebook currently operates both PyTorch and the Convolutional Architecture for Fast Feature Embedding (Caffe2). It is one of the standard libraries for deep neural network research and implementation.

### Tensors

A basic data structure for holding data in PyTorch is a tensor. An m × n tensor is a multidimensional data structure with n rows and m columns, very similar to a Numpy ndarray. One critical difference between tensors and Numpy arrays is that tensors can be moved to the GPU for rapid processing. Below is a very simple example of creating a 2 × 2 PyTorch tensor.

> import torch > data = [ [1,2], [3,4] ] > t = torch.tensor( data ) > print( t ) tensor([[1, 2], [3, 4]])

If we wanted to move the tensor to the GPU, we would first check to ensure GPU processing is available, then use the to() method to transfer the tensor from CPU memory to GPU memory.

> if torch.cuda.is_available(): > device = torch.device( 'cuda' ) > else: > device = torch.device( 'cpu' ) > t.to( device )

One important caveat is that you must do all processing on either the CPU or the GPU. You cannot split data structures and processing between the two processors. So, if you move your tensors to the GPU, you must also move your DNN models to the GPU and ensure all operations are performed on the GPU.

This is a Jupyter notebook that implements a DNN in the simplest possible manner. It is a good starting point for understanding what tensors are and how they can be created and manipulated. The DNN itself trains on six examples of three iris types (setosa, versicolor, virginica) using sepal and petal length and width.

## FCN Exercise 1: Wheat Seed Classification

As an initial example, we will run a simple wheat seed classifier. The dataset contains properties of three types of wheat seeds. The goal is to use these properties to predict the type of wheat the seed represents. we will first demonstrate a "from scratch" DNN that uses a simple single-layer neural network to train, then predict wheat seeds.

Province Area Perimeter Compactness Kernel Length Kernel Width Assymetry Groove Length Type
Ontario 15.26 14.84 0.871 5.763 3.312 2.221 5.22 1
Manitoba 14.88 14.57 0.8811 5.554 3.333 21.018/td> 4.956 1
Nova Scotia 19.13 16.31 0.9035 6.183 3.902 2.109 5.924 2
$\cdots$

The following Python-only DNN uses a simple single-layer 10-neuron DNN (ANN, actually) with a learning rate $\epsilon=0.1$, a learning rate decay of 0.01 per epoch, 1000 training iterations per epoch, a single epoch, and five-fold cross validation during testing.

Next, we'll show the processing the same dataset, but instead of doing it in raw Python, we'll use PyTorch, Facebook's Python-based DNN library. This will demonstrate how much easier it is to use PyTorch to build a significantly more complicated DNN to train, then predict wheat types based on wheat seed properties.

The purpose of these examples is, first, to show you how to code your own DNN using basic Python, then how to use one of the most popular Python libraries (TensorFlow, programmed in C, is the other candidate) to perform the same computation in a simpler to program and more sophisticated, manner.

## FCN Exercise 2: Handwritten Number Recognition

PyTorch's dataset repository includes the MNIST database of handwritten images, which includes 60,000 training examples and 10,000 test examples. It is a very common dataset to use to test learning or pattern recognition algorithms on real-world data, without the need to hand-label a large training dataset.

Normally, images are processed with CNNs. However, we will use a simple FCN to recognize the handwritten images. This is done by converting each handwritten digit image into a set of pixel values, then converting that into a one-dimensional vector. This 1D vector acts as input to a single hidden layer, an ReLU (rectified linear unit) filter, and an output layer with ten possible classifications representing the ten digits 0–9. Here are two simple examples of the digits in the MNIST dataset.

Here is a Jupyter Notebook we will use to load the MNIST dataset, construct a simple FCN, then train and test it on the MNIST data.

Even with a simple FCN with a single hidden layer and an ReLU filter, five training epochs (evaluating the training dataset five times) produces results of 97% or better on the test dataset.

## FCN Exercise 3: Image Processing

To see how well an FCN works on real two-dimensional images, we will work with the CIFAR10 dataset, also a part of PyTorch's dataset repository. CIFAR10 contains 50,000 training images and 10,000 test images of size $32 \times 32 \times 3$: 32 pixels wide by 32 pixels high by 3 pixel components R (red), G (green), and B (blue).

train_dataset = dsets.CIFAR10( root='./data', train=True, transform=xforms.ToTensor(), download=True ) test_dataset = dsets.CIFAR10( root='./data', train=False, transform=xforms.ToTensor() ) classes = ( 'plane','car','bird','cat','deer','dog','frog','horse','ship','truck' )

Note the classes variable. This is used to convert the label value for an image into a semantic text description of the class it belongs to. You will probably want to define this and index into it to better understand which classes the images belong to.

If you want to examine some of the images in the CIFAR10 dataset, you can modify the code in our original FCN example. However, a simpler way to do this and show more images would be as follows.

# Let's look at four of the images in the training dataset, and the # corresponding label (which defines the object the image represents) # Grab the first four images and corresponding objects they represent # from the training set img = [] val = [] for i in range( 4 ): im, v = train_dataset[ 20 * i ] img.append( im ) val.append( v ) img[ i ] = img[ i ] / 2 + 0.5 img[ i ] = img[ i ].numpy() img[ i ] = img[ i ].transpose( 1, 2, 0 ) # Print out what objects the four images are meant to represent print( ' '.join( classes[ val[ i ] ] for i in range( 4 ) ) ) # Create a single row of four images fig, axes = plt.subplots( 1, 4, figsize=( 12, 2.5 ) ) for i in range( 4 ): axes[ i ].imshow( img[ i ] )

Apart from changing any occurrences 28 * 28 to 32 * 32 * 3 (to account for the different input size), the remainder of the code can be identical to the MNIST FCN. You're certainly encouraged to also vary things like criteria, optimizer, hidden_size, the number of hidden layers, and other properties of the FCN to try to improve performance. Remember that, for ten classes, just like the MNIST dataset, chance is 10%. You're unlikely to obtain accuracies anywhere near the 97% we produce for the MNIST images, but you should be able to do much better than 10%, even for this most simple FCN.

Here is a simple Jupyter Notebook that trains on the CIFAR10 dataset using our simple FCN.

## CNN Exercise 4: Image Processing

The wheat dataset is fairly easy to process, but it is probably more representative of the type of task you need to solve: classification from one or more input properties. DNNs themselves have been applied most successfully to image data using convolutional neural networks (CNNs). A CNN converts an $n \times m$ image to an $nm$ vector, then uses that vector as input into a convolution stage. Here, a collection of $k$ kernels are convolved against the pixels and their immediate neighbours to produce scalar values. For each kernel, a column of $nm$ convolved values are created, one for each pixel in the image. In the simplest CNN, the column values are evaluated to produce a single, representative value. For example, max scans a column and extracts the largest value. These values form a $k$-length input vector to a follow-on fully connected network. This network processes output from the CNN to generate a final classification prediction.

## Convolutional Neural Networks

Convolutional neural networks (CNNs) extend FCNs by preceding the fully connected layers with a set of convolutional layers. CNNs are commonly used to analyze images, although recent research has shown that they can handle other data modalities like text with excellent performance.

CNNs are made up of a number of standard operations to produce new, hidden layers. These include

• Convolution (CONV): Application of a convolve operation across a layer, producing a new layer with the convolved results
• Filter (RELU): Application of a filter like ReLU to values in a layer, produce a new layer with filtered results
• Pooling (POOL): Application of a pooling (aggregation) operator to values in a layer, produce a new, smaller layer with pooled results

In practice, it is common to apply a series of CONV-RELU layers, follow them with a POOL layer, and repeat this pattern until an image has been processed to a small size. At this point, results are formed into a 1D vector which acts as input to an FCN. The FCN produces a set of probabilities for each possible classification (softmax).

### Convolution

Convolution combines a pixel in an image and its neighbours by placing a kernel of a given size centered over the pixel, then multiplying the corresponding pixel and kernel values, and summing them together to produce a final filtered result.

The example above using a 3×3 kernel with values -1, 0, and 5 at various positions within its nine cells. When we center it over the pixel at the center of the purple box, multiple, and sum, we obtain a final filtered value of 210 for that position in the image. Kernels are designed to identify specific properties of an image, producing large values when those properties are located, and small values when they are not. For example, the Kirsch filter is designed to identify edges in an image. Convolution of a simple animation with a Kirsch filter produces the result shown below.

 5 5 5 -3 0 -3 -3 -3 -3 Kirsch kernel

Normally, we use numerous kernels to identify different properties or features in an image. Each kernel produces a feature map from the image. These feature maps are normally stacked one on top of another, producing a result with width and height one less than the original image size, and depth equal to the number of kernels applied to the image. How does the CNN decide on the kernel values for each kernel? This occurs during training. Kernel values are initially random, and slowly converge along with edge weights during backpropagation to identify image properties salient to classifying the images.

### Filtering

Filtering adjusts values in a feature map, for example, by normalizing them, or by removing negative values. The common ReLU filter, for example, removes negative values and retains positive values.
 Original Filtered Filtered w/ReLU

### Pooling

Pooling takes a feature map, and reduces its size by aggregating block of values. Aggregation can use any common mathematical operator like average, median, maximum or minimum. Pooling uses a window size which defines its width and height, and a stride which defines the step size it uses as it slides over the feature map values it pools.

 max pooling→   2×2 window   →stride 2

## Images

Recall in the previous discussion of FCNs, we used the MNIST handwriting image dataset. We are now switching to the CIFAR10 dataset of photographic images. In your exercise, you were asked to re-purpose the FCN to handle images from CIFAR10. Here is a simple Jupyter Notebook that does that.

Although this produces results better than chance, we can further improve performance by using a simple CNN.

Results for the CNN are indeed better than the FCN, but only by a few percentage points. This suggests the CNN could be improved possibly significantly, by expanding the CONV/RELU/POOL part of the image to better capture the features needed to differentiate different image classes from one another. One clue is in the individual class accuracies, which show that natural images like cats and deer are being labeled much less accurately than man-made images like cars and trucks.

## Recurrent Neural Networks

Recurrent neural networks (RNNs) are normally used to handle sequence-based data $I= \{ i_1, \ldots, i_n \}$. In simple terms, a basic DNN is designed. The first sample in the input sequence $i_1$ is fed into the DNN, producing both an output $o_1$ and a hidden output $h_1$, $o1 = h1$. The output can be fed into a standard FCN is the user wants to use it for classification. The hidden output $h_1$ is combined with the next sample in the input sequence $i_2$, and this pair $(h_1,i_2)$ is used as input into the DNN. This produces another output $o_2$ and hidden output $h_2$, $o_2 = h_2$. The process continues until all samples in the input sequence are processed, producing a final output $o_n$ and hidden output $h_n$. In other words, a single DNN processes input samples and hidden output from the previous step recursively or recurrently, generating a result that represent both the output from the current RNN step and a hidden output for the next RNN step.

Note that one detail was left unspecified: if the DNN expects a sample input and previous hidden output pair as input, what is the hidden output for the first step in the recursion? Normally, a random hidden output $h_0$ is generated and used for the first processing step in an RNN.

Visually, these images from Michael Phi's excellent page on LSTMs and GRUs shows clearly how input samples $i_j$ from the input sequence $I$ are fed into the RNN's DNN structure one-by-one.
 Processing input samples one-by-one from input sequence $I$ (attribution: Michael Phi)

This close-up image shows how the hidden output $h_{j-1}$ and the current input $i_j$ are combined and passed through a tanh function to produce a continuous output $o_j$ and hidden output value $h_j$.

 A recurrent step combines previous hidden input $h_{j-1}$ and current input $i_j$, then passes the combination through a tanh function to produce a continuous output $o_j$ and hidden output $h_j$ from the DNN (attribution: Michael Phi)

One way of understanding RNNs is that they use the concept of sequence memory. Over time, they have the ability to "learn" sequential patterns. In theory, this is something an FCN or CNN would have more difficulty to do, since an RNN uses previous information (hidden output) where FCNs and CNNs do not. Programatically, this can be expressed in the following way.

rnn = RNN() fcn = FeedForwardDNN() hidden = random() for i in input: output,hidden = rnn( i, hidden ) prediction = fcn( output )

Although basic RNNs are designed to "remember" previous results, they do have an important problem. As an RNN processes inputs, it begins to forget what it has seen in previous steps. Intuitively, you could say that a basic RNN has a short-term memory, but not a long-term memory. This happens because of the way backpropegation occurs. To understand this, think of how any DNN works. There are three major steps: (1) feedforward; (2) error based on predicted class; (3) backpropegation of error based on gradient descent to adjust edge weights. At any given layer in the DNN, its gradient values depend on the successive layer's gradient values. So, if the following layer has small gradient values, the current layer's gradients will be even smaller. This is the vanishing gradient effect. Layers near the beginning of the DNN are learning little or nothing.

Consider an RNN, where each step in the RNN is analogous to a layer in a DNN. A method similar to backpropegation (backpropegation over time) is used to improve the network, so the vanishing gradient effect occurs over steps in the RNN, as opposed to layers in a DNN. This explains intuitively why an RNN can remember recent timesteps, but not ones further back in time.

What is the practical effect of the lack of long-term memory? Consider an RNN processing a sentence term by term.

My dog Goro and I went for a walk, but a squirrel ran in front of us and he started chasing it.

It's easy for us to recognize that "he started chasing it" refers to Goro. But an RNN may no longer remember that "he" refers to a dog named Goro. This is a serious disadvantage for a type of DNN specifically designed to process sequence data.

## Long Short-Term Memory RNNs

Long short-term memory (LSTM) DNNs are a type of RNN designed to address the vanishing gradient problem. Initially introduced by Hochreiter & Schmidhuber in 1977, LSTMs use a set of three "gates" to maintain both a long-term and a short-term memory. Below is an image of the internals of an LSTM.
 The gates, activation functions, and data flow through the DNNs used in LSTMs (attribution: Michael Phi)

To understand what happens internally within the LSTM, we can look at the various data paths. An LSTM is made up of three pieces of information: the current input value from the sequence $i_j$, the previous hidden state $h_{j-1}$, and the previous cell state $c_{j-1}$. Each iteration of an LSTM RNN produces a new hidden state $h_j$, and new cell state $c_j$, and an output $o_j$, with $h_j = o_j$ as before. Intuitively, we think of the hidden state as our short-term memory, and the cell state as our long-term memory.

The path along the top manages the cell state $c_{j-1}$. As noted above, the cell state is the long-term "memory" of the network over the sequence processed to date. We must decide what to remove or "forget" from the cell state. Not surprisingly, this is called the forget gate layer. The previous hidden state $h_{j-1}$ and current input $i_j$ are used to control this decision. Mathematically, the output combines the hidden and input values, then passes them through a sigmoid operator.

$f = \sigma ( W_f \cdot [ h_{j-1},i_j ] + b_f )$

The previous hidden state and current input value $[ h_{j-1},i_j ]$ are multiplied by a weight in $W_f$ and combined with a bias in $b_f$, then passed through the $\sigma$ function to transform the result to the continuous range $[0 \ldots 1]$. A value close to 0 means completely forget, and a value close to 1 means fully retain. Remember that we are (normally) working with vectors $h_{j-1}$, $i_j$, $W_f$, and $b_f$. The resulting vector $f$ is pointwise-multiplied to the previous cell state $c_{j-1}$ using the elementwise product operator $\odot$ to alter its values. At this point it should be clear why values close to 0 or close to 1 remove or retain information in $c_{j-1}$.

Next comes the input gate, which decides what new information will be stored in long-term memory $c_j$. As with the forget gate, it combines the previous hidden state $h_{j-1}$ and the current input value $i_j$ to make this decision. This is made up of two parts $i_1$ and $i_2$. \begin{align} i_1 & = \sigma( W_{i_1} \cdot [ h_{j-1}, i_{j} ] + b_{i_1} )\\ i_2 & = \textrm{tanh}( W_{i_2} \cdot [ h_{j-1}, i_{j} ] + b_{i_2} )\\ \end{align}

Intuitively, $i_1$ dictates what information "passes through" to the cell state, again with 0 indicating no pass-thru, and 1 indicating complete pass-thru. $i_2$ is more difficult to explain intuitively, but its purpose is to regulate the network. $i_1$ and $i_2$ are multiplied, and the result is pointwise-added to the cell state after the forget gate is applied. Again, at this point it should be clear how values on the range $[0 \ldots 1]$ are affecting what new information is being added into long-term memory.

The final stage is the output gate, which combines all three pieces of information: the previous hidden state $h_{j-1}$ (short-term memory), the current input value $i_j$ (new input), and the new cell state after the forget and input gates are applied, $c_j$ (long-term memory) using two parts $o_1$ representing a combination of the previous hidden state and the current input value and $o_2$ representing long-term memory.

\begin{align} o_1 &= \sigma( W_{o_1} \cdot [ h_{j-1}, i_{j} ] + b_{o_1} )\\ o_2 &= \textrm{tanh}( W_{o_2} \cdot c_j + b_{o_2} )\\ h_j, o_j &= o_1 \odot o_2\\ \end{align}

The new short-term memory $o_1$ and long-term memory $o_2$ are pointwise-multiplied to produce the new hidden state and output $h_j$ and $o_j$ for the current sequence entry.

As with all DNNs, the key on how the network functions lies in its weights and biases $W_f$, $b_f$, $W_{i_1}$, $b_{i_1}$, $W_{i_2}$, $b_{i_2}$, $W_{o_1}$, $b_{o_1}$, $W_{o_2}$, and $b_{o_2}$. And as with all DNNs, these values are updated during each step of DNN training by using gradient descent and optimization to adjust the weights and biases in directions that produce results with a lower loss or error.

One final note about LSTMs is that they generally come in two forms: many-to-many or many-to-one. The difference is in whether we care about intermediate output values, or only the final output value produced after the last input $i_n$ in the sequence is processed. For example, if we are performing text generation, we would use a many-to-many approach. The output from each step would be fed through an FCN to produce a generated term, appended to the terms to date as we build up a sentence related to the input being processed. If we were performing sentiment analysis, we would most likely use a many-to-one approach, where we ignored intermediate outputs and passed only the final output through an FCN to determine a sentiment for the input sequence as a whole.

### Airline Passenger Example

As a "simple" example of an LSTM, we'll use a built-in dataset from seaborn, a Python statistical graphics and visualization library. seaborn includes a airline flight dataset with 144 months of data that includes, among other things, the number of passengers that flew each month. We will use an LSTM RNN to train on twelve months (one year) of passenger data, then use the resulting model to predict the number of passengers flying the month immediately following our twelve month training period.

import seaborn as sns flight_data = sns.load_dataset( 'flights' ) print( flight_data.head() ) print( len( flight_data ) ) year month passengers 0 1949 January 112 1 1949 February 118 2 1949 March 132 3 1949 April 129 4 1949 May 121 144

To start, we load the libraries we will use in our program.

import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns import torch import torch.nn as nn from sklearn.preprocessing import MinMaxScaler

Next, we will create our LSTM neural network. Since this is a simple example, the LSTM will be made of of a recurrent node as described above, with output from the node fed into a fully-connected network with a single hidden layer.

class LSTM( nn.Module ): def __init__( self, input_size=1, hidden_n=100, output_size=1 ): super( LSTM, self ).__init__() self.hidden_layer_size = hidden_n self.lstm = nn.LSTM( input_size, hidden_n ) self.fcn = nn.Linear( hidden_n, output_size ) # Hidden cell contains previous hidden state, previous cell state, # randomized to start self.hidden_cell =\ ( torch.zeros( 1, 1, self.hidden_layer_size ), torch.zeros( 1, 1, self.hidden_layer_size ) ) # End method __init__ def forward( self, input_seq ): # input_seq is a 12-value tensor, the 12 months of passengers to # train on, normalized on the range -1..1, as a single row # Input to an LSTM is of shape (seq_len, batch, input_size) # so view below creates a column of 12 "samples", each sample is # a single value, a batch size of 1, and a length of 1 (one 12-value # sample sequence) input_seq = input_seq.view( len( input_seq ), 1, 1 ) # Ask the LSTM for the output and the (hidden,cell) state based # on the 12-value input and the current (hidden,cell) state, this # will recurse the LSTM 12 times lstm_out,self.hidden_cell = self.lstm( input_seq, self.hidden_cell ) lstm_out = lstm_out.view( len( input_seq ), -1 ) # Run final output through the LSTM's FCN to get class # probabilities predictions = self.fcn( lstm_out ) # Highest probability is the class we estimate return predictions[ -1 ] # End method forward # End class LSTM

We will also create a helper function that takes a full sequence of monthly passenger data, and breaks it into tuples of training and label data. The training data is a list of twelve months of passenger counts $t_i = \{ p_j, \ldots, p_{j+11} \}$. The label is the passenger count for the month immediately following the twelve-month training period $l_i = p_{j+12}$. Each of these $(t_i, l_i)$ tuples are stores in a list of training samples used to train our LSTM $T = \{ t_0, t_1, \ldots, t_{119} \}$.

def create_IO_seq( input, tw ): # Create a set of time series to process during training, input is the # entire data stream to divide, tw is time window size in samples # # IO_seq is a (train_seq,label) tuple list, train_seq is a 12-month # set of passengers, label is a single passenger count following the # 12-month sequence IO_seq = [ ] n = len( input ) for i in range( 0, n - tw ): # Grab tw elements as training sequence, next element that follows # is the label (i.e., the number of passengers following the given # 12-month period) train_seq = input[ i: i + tw ] # To be pedantic, pull single tensor value as float, then make a # single-element float list and convert it back to a tensor; can be # done in a single step as: # # train_label = input[ i + tw: i + tw + 1 ] # # but I find that harder to understand # # val is a single float, the number of passengers following the # 12-month training sequence we just extracted # # train_label is [ val ] (a single-element float list) as a tensor val = input[ i + tw ].item() train_label = torch.FloatTensor( [ val ] ) IO_seq.append( (train_seq,train_label) ) return IO_seq # End function create_IO_seq

Next, we will load our dataset, transform the values from raw passenger counts into "normalized" values on the range $[0 \ldots 1]$, then use our helper function create_IO_seq to create our training sequence.

# Load data, divide into train and test flight_data = sns.load_dataset( 'flights' ) # First 132 months for train, last 12 months for test, split into # 12-month training sequences and 1-value labels: # # 1 2 3 4 5 6 7 8 9 10 11 12 13 # - 12 months of training - next value is label # so train on 12-month sequence..then see how many passengers next month all_data = flight_data[ 'passengers' ].values.astype( float ) test_data_size = 12 train_data = all_data[ :-test_data_size ] test_data = all_data[ -test_data_size: ] # Transform/normalize data to range -1..1 scaler = MinMaxScaler( feature_range=(-1,1) ) # First reshape data into a single column then transform to range -1..1 train_data_norm =\ scaler.fit_transform( train_data.reshape( len( train_data ), 1 ) ) # Convert back to PyTorch tensor that's a single row train_data_norm =\ torch.FloatTensor( train_data_norm ).view( len( train_data_norm ) ) # Create 12-value training sequences and correspond next value label train_window = 12 train_IO_seq = create_IO_seq( train_data_norm, train_window )

Finally, we can create our LSTM RNN, choose our loss and optimizer functions (mean squared error and Adam, respectively), and use our training sequence to teach the model how to predict future passenger counts based on the previous year's monthly passenger count sequences. We will process the entire training sequence for 150 epochs.

# Train LSTM model.train() epochs = 150 for i in range( 0, epochs ): for seq,label in train_IO_seq: optimizer.zero_grad() # Re-initialize hidden and cell state to random prior to # walking over the samples in the training sequence model.hidden_cell =\ ( torch.zeros( 1, 1, model.hidden_layer_size ), torch.zeros( 1, 1, model.hidden_layer_size ) ) y_pred = model( seq ) single_loss = loss_function( y_pred, label ) single_loss.backward() optimizer.step() if i % 25 == 1: print( f'epoch {i:3}; loss: {single_loss.item():10.8f}' ) print( f'epoch {i:3}; loss: {single_loss.item():10.10f}' )

Notice an important subtlety here. The order of processing for our LSTM is:

1. Zero any previous gradient information.
2. Randomize both our hidden and cell states.
3. Run the LSTM over the entire 12-month sequence, storing the final output value as y_pred.
4. Compute the loss between our prediction y_pred and the known passenger count for the 13th month.
5. Now use gradient descent to backpropegate the loss through the LSTM.
6. Use the Adam optimizer to update the LSTM's weights, hopefully to better predict future passenger counts.

Critically, gradient descent, backpropegation, and weight optimization happen after each 12-month sequence is processed, and not after each month in the 12-month sequence is processed. This detail is important to understand.

Once the network is trained, we can predict passenger counts for the final twelve months, and compare them to the known counts. Notice we have two options for testing. We can take the known 12-month sequences, or we can take a combination of the known and predicted values to form a 12-month sequence.

$\begin{array}{r l l} t_1 &=\;\; \{ p_{121}, p_{122}, \ldots, p_{132} \} &\rightarrow \;\; l_1\\ t_2 &=\;\; \{ p_{122}, p_{123}, \ldots, p_{133} \} &\rightarrow \;\; l_2\\ &\qquad\qquad\,\cdots&\;\\ t_{12} &=\;\; \{ p_{132}, p_{133}, \ldots, p_{143} \} &\rightarrow \;\; l_{12}\\ \\ &\;\;\;\textrm{versus}\\ \\ t &=\;\; \{ p_{121}, p_{122}, \ldots, p_{132} \} &\rightarrow \;\; l_1\\ t &=\;\; \{ p_{122}, p_{123}, \ldots, l_{1} \} &\rightarrow \;\; l_2\\ &\qquad\qquad\,\cdots&\;\\ t &=\;\; \{ p_{132}, l_{1}, \ldots, l_{11} \} &\rightarrow \;\; l_{12}\\ \end{array}$

We choose to implement the second approach. This approach would be required if you are using all of your known data to predict a target variable different from your predictors. For example, suppose we had 144 months of temperature data, and 156 months of precipitation data. If we use the full 144 months of temperature data to build our model, then the final 12 months we predict for testing have no corresponding temperature data. In this case, we have no choice but to use our predictions to "fill in" the unavailable temperature data as we predict the 1st, 2nd, $\ldots$, and 12th precipitation values. The tradeoff is more data during training versus estimated data that likely contains errors during testing for accuracy.

# Predict final twelve months model.eval() fut_pred = 12 # Grab last twelve months of data, this will be the sequence used to # predict the first test value (remember, 132 training and 12 test # values were split at the beginning of the program) test_outputs = [ ] test_inputs = train_data_norm[ -train_window: ].tolist() # Run through all 12 test values for i in range( 0, fut_pred ): # Convert list of last 12 (normalized) passengers to a tensor seq = torch.FloatTensor( test_inputs[ -train_window: ] ) # with torch.no_grad() runs LSTM without calculating gradients, we # can only do this b/c we know we don't need gradients, backwards() # is not called at the end of this training run, b/c we are passing # one single 12-value sequence and only care about the final output with torch.no_grad(): model.hidden =\ ( torch.zeros( 1, 1, model.hidden_layer_size ), torch.zeros( 1, 1, model.hidden_layer_size ) ) # Append output of LSTM to test_inputs, so when we loop again and # grab the last 12 values for input, it includes the output(s) # the LSTM is generating. Also, make sure to save the outputs for # later accuracy calculations test_inputs.append( model( seq ).item() ) test_outputs.append( test_inputs[ -1 ] )

Finally, to visually inspect our predictions, we plot the full 144-month sequence of known passenger counts in blue, then show the last twelve months of predicted passenger counts in orange.

# The LSTM output is normalized on range -1..1, so we need to invert this # to get actual passenger numbers, need these to do a proper comparison to # known passenger numbers for accuracy calculations actual_pred =\ scaler.inverse_transform( np.array( test_outputs ).reshape( -1, 1 ) ) # Plot known values in blue fig_size = plt.rcParams[ 'figure.figsize' ] fig_size[ 0 ] = 15 fig_size[ 1 ] = 5 plt.rcParams[ 'figure.figsize' ] = fig_size plt.title( 'Months vs Passengers' ) plt.ylabel( 'Total Passengers' ) plt.xlabel( 'Months' ) plt.grid( True ) plt.autoscale( axis='x', tight=True ) plt.plot( flight_data[ 'passengers' ] ) # Add in the predicted values in orange x = np.arange( 132, 144, 1 ) plt.plot( x, actual_pred ) plt.show()

The results are acceptable but not outstanding, although the LSTM was able to catch the up–down seasonal variation in the data, something a basic RNN would probably miss. You can compare this to an approach where we use only known values during testing. Not surprisingly, this produces slightly better results.

## DNN Practice Exercises

Below is an example of the MNIST problem solved using a single-layer FCN (Jupyter Notebook, Python file). Extend this example to use two hidden layers instead of one. The second hidden layer should take input from the first, contain 64 nodes, and use ReLU to transform its output to be continuous. Note: to achieve this, you should only need to make minor changes in the FCN class's init() and forward() methods.

### MNIST FCN Solution

class FCN( nn.Module ): # FCN model to convert image to digit def __init__( self ): super( FCN, self ).__init__() self.hidden0 = nn.Linear( 784, 128 ) self.hidden1 = nn.Linear( 128, 64 ) self.output = nn.Linear( 64, 10 ) self.sigmoid = nn.Sigmoid() self.relu = nn.ReLU() self.softmax = nn.LogSoftmax( dim=1 ) # End function __init__ def forward( self, input ): hidden = self.hidden0( input ) hidden = self.sigmoid( hidden ) hidden = self.hidden1( hidden ) hidden = self.relu( hidden ) output = self.output( hidden ) output = self.softmax( output ) return output # End function forward # End class FCN

Full Python code solution: MNIST-FCN.py

Normally, image data is processed with a CNN rather than an FCN. This worked well here because taking the $2 \times 2$ image and concatenating rows to create a single vector of greyscale values formed patterns that distinguished different digits well. Could we do better with a CNN? Would you like to find out? If so, try implementing one.

This is more complicated than the previous example, because you will need to completely replace the FCN class with a CNN class. The rest of the code can remain unchanged, except for: (1) using a different criterion better suited for images, (2) no need to flatten images because a CNN can handle 2D images directly, unlike an FCN which expects its input to be single 1D vector, and (3) the evaluation section of the mainline, which will need to account for the format of the output from the CNN model.

Here is the CNN model structure you should use.

1. A 2D convolution that takes the MNIST images as input and produces six filters using a $4 \times 4$ kernel with a stride of $1$.
2. An ReLU activation on the results to scale them to the range $[0 \ldots 1]$.
3. A 2D maxpooling of size $2 \times 2$ to downsample the image from size $28 \times 28$. Be careful here! Think hard about what size of image results you'll get from your 2D covolution step (hint: it IS NOT $28 \times 28$, consider what happens when a $4 \times 4$ filter reaches either the right edge or bottom edge of a $28 \times 28$ image). You will need to add padding to the MaxPool2D operation to account for this.
4. Another 2D convolution producing 16 filters using a $2 \times 2$ kernel with a stride of $1$.
5. An ReLU activation on the results.
6. A 2D maxpooling of size $2 \times 2$.
7. Flatten the results into a single 1D vector as input for the follow-on FCN.
8. A linear layer with $120$ nodes taking the results of the convolution stage as input.
9. An ReLU activation on the results.
10. A linear layer with 10 nodes taking output from the previous hidden layer and producing probabilities for the ten possible handwritten letter types.
11. Finally, a LogSoftmax operation on the probabilities to produce normalized logarithmic results.
12. Update your criterion to use nn.CrossEntropyLoss() rather than nn.NLLLoss() in your train() function.
13. In your train() function, you DO NOT need to flatten the images with view(), since a CNN takes the 2D $28 \times 28$ greyscale images directly.

### MNIST CNN Solution

class CNN( nn.Module ): def __init__( self ): super( CNN, self ).__init__() # 2D convolution, 1 input channel, 6 output channels, 4x4 kernel, # stride of 1, 28x28 image will become a 25x25 result, b/c last # 4x4 kernel at right edge and bottom edge produces 1 result for 4 # cells, meaning image shrinks by 3 cells self.conv1 = nn.Conv2d( 1, 6, kernel_size=4, stride=1 ) # 2D convolution, input size of 6(from previous convolution layer), # output size of 16, this follows a MaxPool2d of size 2x2 with a # padding of 1 to expand image to 26x26 from 25x25, this downsamples # the image to 13x13 self.conv2 = nn.Conv2d( 6, 16, kernel_size=2 ) # FCN section of CNN, result of second covolution is again MaxPool2d # at 2x2 resolution with padding of 1, reducing image size from 14x14 # to 7x7 over the 16 output filters self.fc1 = nn.Linear( 16 * 7 * 7, 120 ) self.fc2 = nn.Linear( 120, 10 ) self.pool = nn.MaxPool2d( 2, 2, padding=1 ) self.relu = nn.ReLU() self.softmax = nn.LogSoftmax( dim=1 ) def forward( self, input ): conv = self.conv1( input ) conv = self.relu( conv ) conv = self.pool( conv ) conv = self.conv2( conv ) conv = self.relu( conv ) conv = self.pool( conv ) conv = conv.view( -1, 16 * 7 * 7 ) output = self.fc1( conv ) output = self.relu( output ) output = self.fc2( output ) output = self.softmax( output ) return output # End function forward # End class CNN ... def train( model, epoch, trainloader ): # Training function, use stochastic gradient descent to optimize criterion = nn.CrossEntropyLoss() optimizer = optim.SGD( model.parameters(), lr=0.003, momentum=0.9 ) for e in range( 0, epoch ): running_loss = 0 for images,labels in trainloader: optimizer.zero_grad() output = model( images ) ... model.eval() for images,labels in testloader: # Use trained CNN to convert (batch of 64) images to class probabilities prob = model( images ) # Check all results against known class labels for i in range( 0, len( labels ) ): # Invert probabilities since they are natural log'd # (LogSoftmax), then from a tensor to a list p = torch.exp( prob[ i ] ) p = p.tolist() pred_label = p.index( max( p ) ) true_label = labels[ i ].item() if true_label == pred_label: correct += 1 n += 1 ...

Full Python code solution: MNIST-CNN.py

In my testing, results from the CNN were 98% or higher, slightly better than the FCN example.