Fast.ai: What I Learned from Lessons 1-3

Fast.ai is a great deep learning course for those who prefer to learn by doing. Unlike other courses, here you will build a deep neural network that achieves good results in an image recognition problem in an hour. You start from working code and then dig deeper into the theory and the ways to improve your algorithm. Authors of the course are Jeremy Howard and Rachel Thomas.

I had prior knowledge of machine learning in general and neural networks in particular. I completed Andrew Ng’s Machine Learning course on Coursera, read a lot of articles, papers, and books, including parts of the Deep Learning book, and I’ve been building machine learning applications at work for more than a year. Nevertheless, I learned a lot from fast.ai lessons. When you read books or papers, you learn a lot about the mathematics of deep learning, and little about how to use them in practice. Fast.ai is very helpful to learn practical aspects: how to split the dataset, how to choose good values of hyperparameters, how to prevent overfitting.

So far, I went through the first three lessons. Lesson 1 is mostly about setting up a development environment: set up an AWS account, create a virtual machine on Amazon EC2, run Jupyter Notebook there and execute the code that trains a convolutional neural network. The code sets up a Vgg16 CNN, loads weights pretrained on the ImageNet dataset, finetunes the model on a dataset from Dogs vs. Cats Kaggle competition, and then makes a prediction on test data to submit to Kaggle. I was doing something like that before, but it’s still surprising to me how easy it is to do transfer learning with CNN’s and get great results. Very inspiring.

They use Keras with Theano backend (you can use TensorFlow backend too, all the code still works).

In lesson 2, Jeremy goes over details of his approach to entering the Dogs vs. Cats competition. He gets more into the inner workings of the neural network training: how to initialize weight, the structure of CNN architecture, how loss is calculated and why it’s important, how stochastic gradient descent works.

After you trained a model, it is useful to check what it learned. Error analysis is a first step in improving the model. If you know what typical errors are, you may have some insight into how to fix them. In lesson 2, authors suggest to evaluate the model on the validation set and visualize:

A few examples of correct labels at random
A few examples of incorrect labels at random
Examples of the most accurate labels of each class (i.e., those with the highest probability that are correct)
Examples of the most inaccurate labels in each class (i.e., those with the highest probability that are incorrect)
Examples of the most uncertain labels (i.e., those with probability closest to 0.5)

When a log-likelihood function is used to evaluate the model in a Kaggle competition, it’s useful to clip prediction values to be far from 0 and 1, e.g., set the result to 0.05 for all examples where predicted value is less than 0.05 and set 0.95 for all examples with values over 0.95. This will help to reduce overall loss because log-likelihood loss function value approaches infinity in cases when the label is incorrect, and the prediction approaches 0 or 1.

In lesson 3, Jeremy explains how convolutional and max-pooling layers work, and gives more insights into training a model.

To make experimentation faster, you can split the model into two: one with untrainable convolutional layers, another with trainable dense layers. Precompute features using the first model, train the other model. This approach saves a lot of computation. Convolutional layers are where your computation is taken up. Dense layers are where your memory is taken up.

Jeremy also defines concepts of underfitting and overfitting and tells about ways to deal with underfitting and overfitting.

If underfitting. Remove dropout after finetuning by replacing layers with dropout with probability 0, then finetune more. Use lower learning rate (RMSProp(lr=0.00001, rho=0.7))

Steps for reducing overfitting:

Add more data
Use data augmentation.

It’s easy to do data augmentation on images using keras.image.ImageDataGenerator class. It can zoom, shift, rotate and flip images randomly within parameters that you set. When we use data augmentation, we can’t precompute anything, so it takes longer.

Use architectures that generalize well
Add regularization (dropout, L1, L2).

Nowadays people use dropout in all layers. Usually smaller dropouts in early layers, larger dropouts in later layers. Dropout on early layers makes information unavailable for all later layers.

Reduce architecture complexity

Other tricks.

Always use batch normalization. Increases speed of training ten times. It reduces overfitting. Add batch normalization after all of your layers, e.g., after dropout. When using after a convolutional layer, use axis=1 parameter.

See examples of a lot of fine-tuning tricks is in the Lesson 3 notebook and the MNIST notebook.

Setting a good learning rate schedule matters. In the MNIST notebook, Jeremy starts from setting the learning rate to 0.0001, then sets it super high to 0.1, then gradually reduces until the model starts to overfit.

Lessons 2 and 3 link to other content to read.

Stanford’s CS231n Convolutional Neural Networks for Visual Recognition Neural Networks Part 1 lecture has a lot of interesting ideas.

Regarding activation functions:

“Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.”

Regarding depth of neural networks:

“As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to Convolutional Networks, where depth has been found to be an extremely important component for a good recognition system (e.g. on order of 10 learnable layers). One argument for this observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which are made up of edges, etc.), so several layers of processing make intuitive sense for this data domain.”

Why smaller networks is not the best way to prevent overfitting:

“Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect – there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons.

The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It’s clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks. In practice, what you find is that if you train a small network the final loss can display a good amount of variance – in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you’ll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization.”

Chapter 3 of Neural Networks and Deep Learning textbook has interesting insights into the history of the development of cost functions. If you use a mean squared error cost function with a sigmoid activation function in the output layer then the gradient of the cost function w.r.t. weights of the output layer contains a derivative of the sigmoid function as a component. This derivative gets very low values in case the input of the sigmoid function is far from 0. It leads to slower learning because gradient descent will make small updates to weights in this case. Cross-entropy loss function was invented to get rid of the derivative of the sigmoid function as a component when calculating a gradient. It works well with softmax as well.

The authors keep the content of the course up to date. Soon they will update the course to use PyTorch instead of Keras. This will be interesting to see.

I highly recommend fast.ai course to any developer who wants to learn deep learning.

Besides online version, the course is taught offline once a year in the University of San Francisco. I’m going to attend the course offline this year (it starts on October 30). If you are going there too, please send me a message.