7. Convolutional Neural Networks

Part of CS231n Winter 2016

Lecture 7: Convolutional Neural Networks¶

Two weeks to go on Assignment 2.

7001

Project Proposal? About Project

The four-step process is still relevant.

7002

How did we update parameters?

7003

Dropout was casually introduced by Geoffrey Hinton.

7004

LeNet is a classic architecture.

7005

Hubel and Wiesel's experiments.

7006

Feature hierarchy.

7007

7008

We start with a \(32x32x3\) CIFAR-10 image.

It has 3 channels, so the volume of activations is 3 deep. This corresponds to the 3rd dimension of the volume.

7009

Convolutional Layer: A core building block.

A filter with a depth of 3 will cover the full depth of the input volume. However, it is spatially small (\(5x5\)).

7010

We always extend through the full depth of the input volume.

7011

We will learn \(w\). We are going to slide the filter over the input volume.

As we slide, we perform a 75-dimensional dot product.

7012

This sliding process results in an activation map.

Activation Map Size¶

Because we slide the filter from index 0 to 4 on the input image, we can place the filter in 28x28 distinct locations.

7013

We will actually have a filter bank. Different filters will result in different activation maps.

7014

6 filters will result in 6 activation maps.

7015

Output Dimensions¶

After all the convolutions, we will have a new image sized \(28x28x6\)!

7016

We will have these convolutional layers, which will have a certain number of filters. These filters will have a specific spatial extent (e.g., 5x5). This conv layer will slide over the input and produce a new image. This will be followed by a ReLU and another conv layer.

7017

The filters now have to be \(5x5x6\).

Input Depth Matching¶

These filters are initialized randomly. They will become the parameters in our ConvNet.

7018

When you look at the trained layers, the first layers represent low-level features: color pieces, edges, and blobs.

The first layers will look for these features in the input image as we convolve through it.

As you go deeper, we perform convolution on top of convolution, doing dot products over the outputs of the previous conv layer.

It will put together all the color/edge pieces, making larger and larger features that the neurons will respond to.

For example, mid-level layers might look for circles.

And in the end, we will build object templates and high-level features.

In the leftmost picture, these are raw weights (\(5x5x3\) array).

In the middle and right, these are visualizations of what those layers are responding to in the original image.

7019

This is pretty similar to what Hubel and Wiesel imagined: a bar of a specific orientation leads to more complex features.

7020

A small piece of a car as input.

32 filters of size 5x5 in the first convolutional layer.

Below are example activation maps. White corresponds to high activation, and black corresponds to low activation (low numbers).

Where the blue arrow points to orange stuff in the image, the activation shows that the filter is happy about that part.

7021

A layout like this:

Architecture Overview¶

Also, a Fully Connected layer at the end.

Every row is an activation map. Every column is an operation.

ReLU Layer¶

The image feeds into the left side. We do convolution, thresholding (ReLU), then another Conv, another ReLU, then pooling...

Piece by piece, we create these 3D volumes of higher and higher abstraction. We end up with a volume connected to a large FC layer.

The last matrix multiplication will give us the class scores.

7022

Filter Count¶

We are only concerned about spatial dimensions at this point.

7023

One at a time.

7024

One at a time.

7025

One at a time.

7026

One at a time.

7027

We can use a stride of 2, which is a hyperparameter.

7028

We move two steps at a time.

7029

We are done in fewer steps!

7030

Can we use a stride of 3?

7031

No, we cannot.

7032

This simple formula gives you possible selections. The result should always be an integer.

7033

We can pad! Padding is also a hyperparameter.

7034

If we pad with 1, we can get an output of the same size.

Spatial Preservation¶

7035

Padding Strategy¶

7036

If we do not pad, the size will shrink! We do not want that, as we will have many layers.

7037

10 filters with \(5x5x3\) shape.

7038

The padding is correct, so the spatial size will not change.

10 filters will generate 10 different activation maps.

The output is shaped \(32x32x10\).

7039

Parameter Counting¶

7040

Each filter has \(5*5*3\) parameters plus a single bias. So the total is \(10 * 76 = 760\).

7041

Here is the summary so far:

7042

Filter Hyperparameters¶

Number of filters
The spatial extent of the filters - \(F\)
The stride - \(S\)
The amount of zero padding - \(P\)

7043

We can compute the size of the activation output with the formula. The depth will be the number of filters \(K\). \(F\) is usually odd.

The total number of parameters will depend on input depth, filter size, and bias.

\(K\) is usually chosen as a power of 2 for computational reasons. Some libraries use special subroutines when they see powers of 2.

7044

We can use \(1x1\) convolutions. You are still doing a lot of computation; you are just not merging information spatially.

Zero Padding Rationale¶

Non-Square Inputs¶

We will see how to work with non-rectangular images later.

7045

Terminology¶

The API of SpatialConvolution in Torch:

nInputPlane: The depth of the input layer.
nOutputPlane: How many filters you have.
kW, kH: Kernel width and height.
dW, dH: Step size (stride).
padW, padH: The padding you want.

This is referring to Lua Torch (Torch7), which was the predecessor to the modern PyTorch, with this Conv2d class here.

7046

It is the same in Caffe.

7047

It is the same in Lasagne.

Biological Perspective¶

7048

With this filter, we end up with one number in a convolution.

7049

The output of the filter at this position is just a neuron fixed in space, looking at a small part of the input, computing \(w^T x + b\).

It has no connections to other parts of the image, hence local connectivity.

7050

We sometimes refer to the neuron's receptive field as the size of the filter (the region of the input the filter is looking at).

In a single activation map (28x28 grid), these neurons share parameters (because one filter computes all the outputs), so all the neurons have the same weights \(w\).

We have several filters, so spatially they share weights, but across depth, these are all different neurons.

7051

A nice advantage of both local connectivity and spatial parameter sharing is that it basically controls the capacity of the model.

It makes sense that neurons would want to compute similar things. For example, if they are looking for edges, a vertical edge in the middle of an image is just as useful anywhere else spatially.

It makes sense to share those parameters spatially as a way of controlling overfitting.

7052

We have covered Conv and ReLU layers.

Pooling Layer¶

7053

The Conv layer usually preserves the spatial size (with padding).

The spatial shrinking is done by pooling.

Motivation¶

The most common method is max pooling.

7054

It reduces the size by half on all activation maps. Average pooling does not work as well.

7055

We need to know the filter size and stride. \(2x2\) with stride \(2\) is common.

7056

The depth of the volume does not change.

Fully Connected Layer¶

7057

With 3 pooling layers (\(2x2\), stride 2), we go from 32 -> 16 -> 8 -> 4.

At the end, we have a \(4x4x10\) volume of activations after the last pooling.

That goes into the Fully Connected layer.

Demo¶

Website here. It achieves 80% accuracy for CIFAR-10 in JavaScript!

It uses 6 or 7 nested loops. The V8 engine in Chrome is good, so JS is fast.

7058

All running in the browser.

Stacking Layers¶

Why is that we are stacking layers? 🤔

Because we want to do dot products, and we can backpropagate through them efficiently.

Batch Dimensions¶

If you are working with image batches, all the volume between Convnet's are 4D arrays. If single image, 3D arrays.

Visualization¶

Intermediate filters are not properly visualized. Yann LeCun did what the neurons are responding to.

Pooling Rationale¶

When you do pooling, you throw away some spatial information because you want to eventually get the scores out.

Boundary Effects¶

Because of padding, the statistics of border is different than center, we do not worry about it.

Backpropagation Compatibility¶

Anything you can back propagate through, you can put in a ConvNet / Neural Net.

Case Studies¶

LeNet-5¶

Figure from the paper. 6 filters, all \(5x5\), with sub-sampling (max pooling).

7059

AlexNet¶

It won the ImageNet challenge. 60 Million Parameters.

The input is large.

Two separate streams? Alex had to split the convolutions onto 2 separate GPUs.

Let's imagine if it had a single stream.

7060

The output volume will be: \(55x55x96\), because we have 96 kernels/filters.

7061

Total parameters: every filter is \(11x11x3\) x 96 roughly.

We are not even sure what Alex did. 😅

The input image is \(224x224\), but for the math to add up, the input should be \(227x227\).

7062

After pooling? Half of the spatial size, so \(27x27x96\).

7063

How many parameters are in the pooling layer?

7064

0 - only Conv layers have parameters.

7065

Summary:

7066

Full architecture:

7067

This is a classic sandwich. Sometimes filter sizes change. We backpropagate through all of this.

7068

First use of ReLU, used normalization layers (not used anymore), used dropout only on the last fully connected layers, and an ensemble of 7 models.

ZFNet¶

Built on top of AlexNet.

\(11x11\) stride 4 was too drastic, so they changed to \(7x7\) filters.

They used more filters in Conv 3, 4, and 5.

The error became 14.8%. The author of this paper founded a company called Clarifai and reported 11% error.

Here is the company about.

Founded in 2013 by Matthew Zeiler, Ph.D., a foremost expert in machine learning, Clarifai has been a market leader since winning the top five places in image classification at the ImageNet 2013 competition.

Top-5 Error

There are 1000 classes, and we give the classifier 5 chances to guess. 😌

7069

VGGNet¶

140 Million Parameters. They have different types of architectures. They decided to use a single set of filters. The question is:

Layer Count

Turns out, 16 layers performed the best. They dropped the error to 7.3%.

7070

This is the full architecture:

Spatial Reduction

Spatially the volumes get smaller, number of filters are increasing

7071

Memory Usage

7072

If we add up all the numbers, it's 24M. If we use float32, that's 93 MB of memory for intermediate activation volumes per image.

That is maintained in memory because we need it for backpropagation.

Just to represent 1 image, it takes 93 MB of RAM ONLY for the FORWARD pass. For the backward pass, we also need the gradients, so we end up with a 200 MB footprint.

Parameter Count

Most memory is in early Conv layers; most parameters are in late FC layers.

We found that these huge Fully Connected layers are not necessary.

Average Pooling

Instead of FC on \(7x7x512\), you can average on \(7x7\) and make it a single \(1x1x512\), which works just as well.

GoogleNet¶

The key innovation here was the Inception module. Instead of using direct convolutions, they used inception modules.

A sequence of inception modules makes up GoogleNet. You can read the paper.

It won the 2014 challenge with 6.7% error.

7073

At the very end, they had \(7x7x1024\) and they did an average pool! That means much fewer parameters!

7074

ResNet¶

7075

Here is what the history looks like.

7076

More layers. You have to be careful how you increase the number of layers.

Plain vs ResNets

A 56-layer network performs worse than a 44-layer network. Why?

In ResNets, increasing the number of layers will always result in better results.

7077

Runtime Performance

At Runtime, it is actually faster than a VGGnet - how ?

7078

This is a plain net below:

We will have skip connections.

You take a \(224x224\) image, pool by a huge factor, and work spatially on \(56x56\). It's still really good.

Depth comes at the cost of spatial resolution very early on, because depth is to their advantage.

7079

In a plain net, you have some function \(f(x)\) you are trying to compute. You transform your representation, have a weight layer, threshold it, and so on.

In a ResNet, your input flows in. But instead of computing how to transform your input into \(f(x)\), you compute what to add to your input to transform it into \(F(x)\).

You compute a delta on top of your original representation instead of a new representation right away, which would discard the original information about \(X\).

Delta Modulation Analogy

In analogy, you can think of delta modulation as encoding the difference between successive samples (input and output), somewhat akin to how the ResNet architecture focuses on learning the residual (difference) between input and output to improve learning efficiency. Both methods leverage this residual information for better representation or reconstruction.

You are computing just these deltas to these \(x\)'s.

If you think about the gradient flow in a ResNet, when a gradient comes, it performs addition (remember, addition distributes the gradient to all of its children). The gradient will flow to the top, skipping over the straight part.

7080

You can train right away, really close to the image, to the first Conv Layer.

7081

These are the commonly used hyperparameters.

Batch norm layers will allow you to get away with a bigger learning rate.

7082

Using \(1x1\) Convs in clever ways.

7083

This is the whole architecture; Andrej skipped it in the interest of time.

7084

This was on the cover of AlphaGo.

7085

This was a convolutional network!

7086

The input is \(19x19x48\) because they are using 48 different features based on the specific rules of Go. You can understand what is going on when you read the paper.

Other Go Deep Learning player: CrazyStone

7087

The trend is to get rid of Pooling and Fully Connected Layers.
Smaller filters and deeper architectures.

Done with lecture 7!

7. Convolutional Neural Networks

Lecture 7: Convolutional Neural Networks¶

Activation Map Size¶

Output Dimensions¶

Input Depth Matching¶

Architecture Overview¶

ReLU Layer¶

Filter Count¶

Spatial Preservation¶

Padding Strategy¶

Parameter Counting¶

Filter Hyperparameters¶

Zero Padding Rationale¶

Non-Square Inputs¶

Terminology¶

Biological Perspective¶

Weight Sharing¶

Pooling Layer¶

Motivation¶

Fully Connected Layer¶

Demo¶

Stacking Layers¶

Batch Dimensions¶

Visualization¶

Pooling Rationale¶

Boundary Effects¶

Backpropagation Compatibility¶

Case Studies¶

LeNet-5¶

AlexNet¶

ZFNet¶

VGGNet¶

GoogleNet¶

ResNet¶