6. Training Neural Networks Part 2

Part of CS231n Winter 2016

Lecture 6: Training Neural Networks, Part 2¶

6001

By the end of the assignment, you will have a good understanding of all the low-level details of how a ConvNet classifies images.

I am so excited! Here is the Assignment link again.

Training a ConvNet is a four-step process.

Loss: Tells us how well we are classifying at the moment.
Backpropagation: We backpropagate to compute the gradient on all the weights. This gradient tells us how we should nudge every single weight to make better classifications.
Update: We use the gradients to make a small nudge to the weights.

6002

There is an entire zoo of activation functions available.

6003

Activation Functions¶

If you do not use an activation function, your entire network will be a linear sandwich.

Your capacity is equal to that of just a linear classifier.

Activation functions are critical; they provide the non-linearity needed to fit your data.

6004

The problem here is: how should we start? Xavier initialization is a reasonable starting point.

6005

Batch Normalization (BN) gets rid of many headaches. It reduces the strong dependence on initialization.

6006

Here are some tips and tricks for babysitting the learning process.

6007

Today's Agenda¶

6008

6009

The process looks like this:

Loss: Tells us how well we are classifying at the moment.
Backpropagation: We backpropagate to compute the gradient on all the weights. This gradient tells us how we should nudge every single weight to make better classifications.
Update: We use the gradients to make a small nudge to the weights.

6010

Parameter update is just gradient descent. Can we make it better?

6011

Stochastic Gradient Descent¶

The classic .gif is shown below. In practice, you rarely use vanilla SGD.

6012

parameter update2d

parameter update3d

SGD is the slowest among all of them.

6013

There is a big arrow pointing up and a small one pointing right.

6014

You are going way too fast in one direction and very slow in the other. This results in jitter.

6015

Momentum¶

\(mu\) is a hyperparameter between 0 and 1.

To solve this problem, we can use momentum.

6016

We don't use the learning rate directly; instead, we use velocity to make an update.

Think of a ball rolling around and slowing down over time:

Gradient is force.
\(mu * v\) is friction.
\(v\) - velocity is initialized with 0.

6017

SGD is slower than momentum, as expected. Momentum overshoots the target because it builds up velocity.

6018

Nesterov Momentum¶

A variation of Momentum Update.

Momentum and gradient step together? We evaluate the gradient at the end of the momentum step.

6019

It involves a one-step look-ahead. Evaluate the gradient at the look-ahead step.

6020

In theory and in practice, it almost always works better than standard momentum.

6021

This is a bit ugly and doesn't fit well in a single API. Normally, we do a forward pass and a backward pass, so we usually have a parameter vector and gradient at that point.

6022

You can perform a variable transform.

6023

You can check the notes for more details.

6024

NAG stands for Nesterov Accelerated Gradient in the graph:

6025

NAG curls around much more quickly than SGD with Momentum. 🍓

parameter update2d

Local Minima¶

As you scale up Neural Networks, the local minima issue goes away; the best and worst local minima get really close.

AdaGrad¶

Is it a scale on SGD?

It is very common in practice. Originally developed in convex optimization literature, it was ported to Neural Networks.

cache += dx **2

We build a cache which is the sum of squared gradients, a giant vector of the same size as the parameter vector.

Un-centered Second Moment? This is called a per-parameter adaptive learning rate method. Every single dimension of the parameter space now has its own learning rate that is scaled dynamically based on the gradients we are seeing.

6026

What happens with AdaGrad when updating?

6027

We have a large gradient vertically. That large gradient (fast changes) will be added to the cache, and then we end up dividing by larger and larger numbers, so we'll get smaller and smaller updates in the vertical step.

Since we're seeing lots of large gradients vertically, this will decay the learning rate, and we'll make smaller and smaller steps in the vertical direction.

But in the horizontal direction—which is a very shallow direction—we end up with smaller numbers in the denominator. Relative to the Y dimension, we're going to end up making faster progress.

So we have this equalizing effect of accounting for the steepness, and in shallow directions, you can actually have a much larger learning rate compared to the vertical directions.

That's AdaGrad.

6028

One problem with AdaGrad: it can decay to a halt.¶

Your cache ends up building up all the time. You add all these positive numbers to your denominator, so your learning rate just decays towards zero, and you end up stopping learning completely.

That's okay in convex problems, perhaps, where you just have a ball and you decay down to the optimum and you're done.

But in a neural network, things are shuffling around and trying to pick your data. It needs continuous energy to fit your data.

You don't want it to just decay to a halt.

1e-7 is there to prevent the division by zero error. It is also a hyperparameter.

rmsprop will forget the gradients from long ago; it is an exponentially weighted sum.

6029

RMSProp¶

There's a very simple change to AdaGrad that was proposed by Geoff Hinton: rmsprop. 🤭

Instead of keeping just the sum of squares in every single dimension, we make that counter a leaky counter.

We introduce a decay rate hyperparameter, usually set to something like 0.99. You accumulate the sum of squares, but it leaks slowly with this decay rate.

We still maintain this nice equalizing effect of step sizes in steep or shallow directions, but we won't converge completely to zero updates.

It was just a slide in a Coursera course.

6030

People cited this slide. 😅

6031

Here is the image again. AdaGrad is blue, RMSProp is black.

6032

parameter update2d

Usually, in practice when training deep neural networks, adagrad stops too early, and rmsprop ends up winning out.

Adam¶

Combine AdaGrad with Momentum. 🍉

Adam is a recent update that has elements of both.

The Adam optimizer is not necessarily the "best" for all neural networks, but it is a popular and effective choice for many applications. There are several reasons for its popularity:

Adaptive learning rate: Adam optimizer adapts the learning rate for each parameter, which helps in faster convergence and better performance. It combines the advantages of two other popular optimization methods, AdaGrad and RMSProp, by using the first moment estimate (mean) and the second moment estimate (variance) of the gradients.
Memory efficiency: Unlike other adaptive learning rate methods like AdaGrad and RMSProp, Adam only requires the storage of two additional moments (mean and variance) per parameter, making it more memory-efficient.
Easy to implement: Adam is relatively easy to implement, as it only requires the computation of the mean and variance of the gradients, which can be done efficiently using moving averages.
Robust performance: Adam has been shown to perform well on various optimization tasks, including deep neural networks, making it a popular choice among practitioners.

However, it is essential to note that the choice of optimizer depends on the specific problem and the nature of the data. It is always recommended to experiment with different optimizers and tune their hyperparameters to find the best fit for a given task.

6033

It's kind of like both together.

In \(m\), it sums up the raw gradients, keeping the exponential sum.

In \(v\), it keeps track of the second moment of the gradient and its exponential sum.

6034

If we compare rmsprop with Momentum and Adam:

6035

Beta1 and Beta2 are hyperparameters. Usually, \(beta1 = 0.9\) and \(beta2 = 0.995\).

We replace the \(dx\) (in the second equation) in RMSProp with \(m\), which is the running counter of \(dx\).

At any time, you will have noisy gradients. Instead of using those noisy gradients, you use a weighted (decaying) sum of previous gradients, which stabilizes the gradient direction.

The fully complete version is shown below:

6036

There is also bias correction, which depends on the time step \(t\). Bias correction is only important as Adam is warming up.

6037

It depends.

6038

You should start with a high learning rate. It optimizes faster. At some point, you will be too stochastic and cannot converge to your minima nicely because you have too much energy in your system and cannot settle down into the nice parts of your loss function.

Decay your learning rate, and you can ride this wagon of decreasing learning rates to do best in all of them.

Epoch¶

1 Epoch means you have seen all of the training data once.

Learning Rate Decays¶

Step - Exponential - 1/t

These learning rate decays are solid for SGD and Momentum SGD. Adam and AdaGrad are less dependent on them.

Andrej uses Adam for everything now. 🥳

These are all first-order methods because they only use the gradient information of your loss function. When you evaluate the gradient, you know the slope in every single direction.

Second Order Methods¶

These provide a better approximation to your loss function. They do not only approximate with the hyperplane (which way we are sloping) but also approximate it with the Hessian, telling you how your surface is curving.

6039

Faster convergence
Fewer Hyperparameters - No need for a learning rate.

6040

Your Hessian will be gigantic:
If you have a 100 million parameter network, your Hessian will be 100mil x 100mil, and you want to invert it.

So, this is not a good idea in Neural Networks.

You can get around inverting the Hessian using BGFS and L-BFGS.

6041

These are used in practice.

6042

L-BFGS works really well on \(f(X)\) functions. In mini-batches, it doesn't work well.

6043

Adam is the default. If you have a small dataset, you can look up L-BFGS.

6044

What does that mean?

6045

Multiple models, average the results. You have to train all of these models, so that is not ideal.

6046

You save a checkpoint when you are training.

Model Ensembles¶

6047

x_test is a running sum, exponentially decaying. This x_test works better on validation data.

6048

Dropout¶

A very important technique.

As you are doing a forward pass, you set some neurons randomly to zero.

6049

\(U1\) is zeros and ones, a binary mask. We apply this mask to hidden layer 1 \(H1\) (effectively dropping half of them).

We also do this for the second hidden layer. Do not forget we need to consider this in the backward pass too.

6050

Motivation¶

Maybe it will prevent overfitting? All features can have the same strength.

It forces all the neurons to be useful.

6051

Feature Co-adaptation¶

6052

You cannot rely on a single feature.

6053

A dropped-out neuron will not have connections to the previous layer, as if it were not there.

You are sub-sampling a part of your Neural Network, and you are only training that neural network on that single example that you have at that point in time.

You want to apply stronger dropout where there is a huge number of parameters.

In practice, you do not use dropout at the start of Convolutional Neural Networks; you scale the dropout over time.

Instead of dropping gradients, you can drop weights. That is called DropConnect.¶

6054

We would like to integrate out all of the noise. You can try all binary masks and average the result, but that is not really efficient.

6055

You can approximate this with Monte Carlo.

In an ideal world, you do not want to leave any neurons behind.

6056

Can we use expectation?

6057

During testing, a linear neuron will give, in expectation, half of what it gives at training time.

That half comes from the half of the units we dropped.

6058

If we do not do this, we will end up having too large of an output compared to what we had in expectation at training time. Things will break in the NN, as they are not used to seeing such large outputs from the neurons.