5. Training Neural Networks Part 1

Part of CS231n Winter 2016

Lecture 5: Training Neural Networks, Part I¶

Here are some details about the assignments.

5001

In this lecture, we transition from the theoretical architecture of neural networks to the practical reality of training them.

We have defined the score function and the loss function, and we know how to compute gradients via backpropagation. Now we must navigate the optimization landscape.

Project Proposals and Advice¶

Before beginning the technical content, a few words on course projects.

5002

5003

One effective strategy is fine-tuning. You rarely need to train a network from scratch. Instead, you can take a pre-trained model (trained on a large dataset like ImageNet) and adapt it to your specific problem.

5004

You can "chop off" the final classification layer and treat the rest of the network as a fixed feature extractor, training only a new linear classifier on top. Alternatively, you can fine-tune the entire network.

5005

There are many pre-trained models available (Caffe Model Zoo, etc.) that you can leverage.

5006

A word of caution regarding compute resources:

5007

Hyperparameter optimization requires significant computational power. Be mindful of your resource usage, as compute is finite.

5008

Training Overview¶

We are now at the stage where we loop through the training process:

5009

Sample a batch of data.
Forward prop to compute loss.
Backprop to compute gradients.
Update parameters.

This is an optimization problem.

5010

Neural networks can be incredibly large and complex.

5011

However, the complexity is managed by the chain rule. We simply need to implement the forward and backward API for each module.

5012

5013

For example, a simple multiplication gate:

5014

We can think of these as LEGO blocks that we stack together.

5015

We have seen activation functions before, which introduce non-linearity.

5016

And we have discussed the loose inspiration from biological neurons.

5017

In a fully connected network, the layers with learnable weights (Fully Connected layers) are interleaved with activation functions.

5018

History and Context¶

It is helpful to zoom out and look at the history of this field.

5019

1957: The Perceptron (Rosenblatt): Early implementations were built with hardware circuits.

5020

The activation function was a binary step function. Since this is not differentiable, backpropagation as we know it was not possible. They used simple update rules.

1960: Adaline/Madaline (Widrow & Hoff): Researchers started stacking these units (Multilayer Perceptron).

5021

However, without a way to train the hidden layers effectively, progress stalled.

1986: Backpropagation (Rumelhart, Hinton, Williams): The field was reignited by the derivation of backpropagation, allowing training of multi-layer networks.

5022

Despite the excitement, training deep networks proved difficult. Gradients would vanish or explode, and training would get stuck.

2006: Deep Learning & RBMs (Hinton, Salakhutdinov): A breakthrough came with Deep Learning. The key idea was unsupervised pre-training using Restricted Boltzmann Machines (RBMs).

5023

You would train the first layer to reconstruct the input, then freeze it and train the second layer, and so on. Finally, you would fine-tune the whole network with backpropagation. This initialization allowed for deeper networks.

2010-2012: The Explosion: By 2010, acoustic modeling (speech recognition) saw huge gains by replacing GMMs with Deep Neural Networks. Then came 2012.

5024

AlexNet crushed the ImageNet competition. The field exploded.

Why 2012?

Better initialization (no longer needed complex pre-training).
Better activation functions (ReLU).
More data (ImageNet).
Better compute (GPUs).

Activation Functions¶

We will now focus on the specific choices we make when designing and training these networks. First: Activation Functions.

5025

5026

There are many options available.

5027

Sigmoid¶

Historically, the sigmoid function was very common. It squashes real-valued inputs to the range [0, 1].

5028

However, it has severe problems:

Vanishing Gradients: When the neuron is saturated (output close to 0 or 1), the gradient is nearly zero.

5029

During backpropagation, this local gradient is multiplied by the upstream gradient. If the local gradient is zero, it "kills" the gradient flow to all previous layers.

5030

Not Zero-Centered: The output is always positive.

5031

If the input \(x\) to a neuron is always positive, then the gradients on the weights \(w\) will all be either positive or negative (depending on the gradient of the loss).

5032

This constrains the updates to be in specific directions (zig-zagging), which is inefficient.

5033

Empirically, non-zero-centered data leads to slower convergence. So you want to have things that are zero centered.

Expensive: The exp() function is computationally expensive compared to simple math operations.

5034

When we are training CNN's most of compute time is actually in convolutions and dot products. So we want to make sure that we are using efficient ways to compute these.

Yann Lecun recommended using tanh() instead of sigmoids.

Tanh¶

The hyperbolic tangent squashes numbers to [-1, 1].

5035

Pros: It is zero-centered.
Cons: It still suffers from the vanishing gradient problem when saturated.

ReLU (Rectified Linear Unit)¶

The modern standard: \(f(x) = \max(0, x)\).

5036

Pros:
- Does not saturate in the positive region.
- Computationally very efficient.
- Converges much faster (e.g., 6x faster for AlexNet).
Cons:
- Not zero-centered.
- Dead ReLU Problem: When \(x < 0\) gradient dies.

5037

If a neuron falls into the negative region, its output is 0 and its gradient is 0. It effectively "dies" and may never recover.

5038

In practice, you might find that 10-20% of your network is "dead" if you are not careful.

5039

Tip: Initialize biases with a small positive number (e.g., 0.01) to ensure ReLUs start active.

Leaky ReLU¶

Attempts to fix the dead ReLU problem by having a small negative slope (e.g., 0.01) when \(x < 0\).

5040

PReLU (Parametric ReLU)¶

The slope in the negative region is a learnable parameter \(\alpha\). Andrej is not completely sold on them.

5041

ELU (Exponential Linear Unit)¶

A recent proposal (Clevert et al., 2015) that has benefits of ReLU but is closer to zero mean.

5042

Maxout¶

Proposed by Ian Goodfellow et al. It generalizes ReLU and Leaky ReLU.

\(f(x) = \max(w_1^T x + b_1, w_2^T x + b_2)\)

5043

It has no saturation and no dying ReLU problem, but it doubles the number of parameters per neuron.

5044

Summary of Activations¶

5045

Recommendation:

Use ReLU. Be careful with your learning rates.
Try Leaky ReLU or Maxout.
Try Tanh but don't expect much.
Never use Sigmoid.

5046

Data Preprocessing¶

We generally want our input data to be well-behaved.

5047

Standard practice in Machine Learning involves:

Mean Subtraction: Center the data around zero.
Normalization: Scale the data so each dimension has unit variance.

5048

Other techniques like PCA and Whitening (decorrelating the data) are common in general ML but less common in image processing due to the high dimensionality.

5049

For Images:

Subtract the mean image (e.g., AlexNet).
Or subtract the per-channel mean (e.g., VGGNet).
Normalization is usually not strictly necessary because pixel values are already on the same scale (0-255).

5050

Weight Initialization¶

How do we start the optimization? We cannot initialize all weights to zero.

5051

If all weights are zero, every neuron computes the same output and gets the same gradient update. There is no symmetry breaking.

5052

Small Random Numbers¶

A common first attempt is small random noise: W = 0.01 * np.random.randn(D, H).

5053

This works for shallow networks, but fails for deep ones.

Let's look at an experiment with a 10-layer network using Tanh non-linearities.

5054

As data flows through the layers, it is multiplied by small numbers (0.01). The activations quickly shrink to zero.

5055

Why is this bad? During backpropagation, the gradient on the weights is \(X \times dL/df\). If the input \(X\) (the activation from the previous layer) is tiny, the gradient will be tiny. The network will not learn.

5056

Large Random Numbers¶

What if we use larger weights? W = 1.0 * np.random.randn(D, H).

5057

Now the neurons saturate. Tanh outputs become -1 or +1. The gradients become zero. The network does not learn.

Xavier Initialization¶

We want the variance of the input to be the same as the variance of the output. Glorot and Bengio (2010) derived a formula for this: W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)

5058

This keeps the activations well-scaled across many layers.

5059

However, this derivation assumes linear activations. If we use ReLU, it breaks. ReLU kills half the variance (sets negative values to 0).

5060

He Initialization¶

He et al. (2015) corrected this for ReLU by adding a factor of 2. W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2)

5061

This is the current standard for initializing ReLU networks.

5062

Batch Normalization¶

PS: This is explained in more detail in assignment 2.

Batch Normalization (Ioffe & Szegedy, 2015) is a technique to explicitly force the activations to be unit gaussian throughout the network.

5063

The Idea: For each feature dimension, compute the mean and variance over the current mini-batch, then normalize.

\[\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\]

5064

We typically insert this layer after the Fully Connected or Convolutional layer, and before the non-linearity.

5065

However, we don't want to constrain the network too much. We add learnable parameters \(\gamma\) (scale) and \(\beta\) (shift) so the network can learn to undo the normalization if it needs to.

\[y = \gamma \hat{x} + \beta\]

At Test Time: We don't use the batch mean/variance. Instead, we use a running average of mean/variance collected during training.

5066

Benefits:

Reduces sensitivity to initialization.
Allows higher learning rates.
Acts as a regularizer.

5067

5068

It is good thing to use. But there is a runtime penalty.

Layer Normalization: A related technique is Layer Normalization, which normalizes across the features for a single example, rather than across the batch. This is useful for RNNs or when batch sizes are small.

batchNorm layerNorm

Babysitting the Learning Process¶

Now we look at the practical steps of monitoring training.

5069

Step 1: Preprocessing: Zero-center your data.

5070

Step 2: Architecture: Choose your architecture (e.g., 2-layer net, 50 hidden neurons).

5071

Step 3: Double Check the Loss: Disable regularization. The loss should be around \(-\log(1/C)\) where \(C\) is the number of classes. For CIFAR-10 (\(C=10\)), loss should be \(\approx 2.3\).

5072

If you add regularization, the loss should go up.

5073

Step 4: Sanity Check (Overfit Small Data): Take a tiny subset of data (e.g., 20 examples). Turn off regularization. Train. You should be able to get 100% accuracy and loss of 0.

5074

5075

If you can't overfit a small dataset, your model is broken.

Step 5: Find Learning Rate: Now use the full dataset (with small regularization). Start with a small learning rate.

5076

If the loss doesn't go down, the learning rate is too low.

5077

5078

Notice that loss barely changes, but accuracy jumps? This is because weights are shifting slightly to make correct scores just barely higher.

5079

If the learning rate is too high, the loss explodes (NaN).

5080

5081

You want to find a learning rate that is "just right" (roughly in the range [\(1e^{-3}\), \(1e^{-5}\)]).

5082

5083

Hyperparameter Optimization¶

We need to find the best hyperparameters (Learning Rate, Regularization, Dropout, etc.).

5084

Strategy: Coarse to Fine First, search a wide range for a few epochs.

5085

Tip: Optimize in Log Space. Learning rates and regularization strengths are multiplicative. Sample exponents uniformly from a range. 10 ** uniform(-3, -6)

5086

Once you find a good region, narrow the search and run for longer.

5087

Random Search vs. Grid Search Always use Random Search.

5088

Grid search is inefficient because some hyperparameters are more important than others. Random search explores more unique values for the important parameters.

5089