Trains a Convolutional Neural Network to classify the CIFAR10 dataset:
In this blog post we will provide a guide through for transfer learning with the main aspects to take into account in the process, some tips and an example implementation in Keras using ResNet50 as the trained model. The task is to transfer the learning of a ResNet50 trained with Imagenet to a model that identify images from CIFAR-10 dataset. Several methods were tested to achieve a greater accuracy which we provide to show the variety of options for a training. However with the final model of this blog we get an accuracy of 96% on test set.
Learning something new takes time and practice but we find it easy to do similar tasks. This is thanks to human association involved in learning. We have the capability to identify patterns from previous knowledge an apply it into new learning.
When we meet a person than is faster or better than us in something like a video game or coding it is almost certain that he has do it before or there is an association with a previous similar activity.
If we know how to ride a bike, we don’t need to learn from zero how to ride a motorbike. If we know how to play football, we don’t need to learn from zero how to play futsal. If we know how to play the piano, we don’t need to learn from zero how to play another instrument.
The same is applicable to machines, if we train a model with a database, it’s not necessary to retrain from zero all the model to adjust to a new similar dataset. Both Imagenet and CIFAR-10 have images that can train a model to classify images. Then, it is very promising if we can save time training a model (because it can really take long time) and start using the weights of a previously trained model. We are going through this concept of transfer learning with all what you need to also build a model on your own.
Materials and Methods
Setting our environment
We are going to use Keras which is an open source library written in Python for neural networks. We work over it with tensorflow in a Google Colab, a Jupyter notebook environment that runs in the cloud.
The first thing we do is importing the libraries needed with the line of code below. Running the version as 1.x is optional, without that first line it will run the last version of tensorflow for Colab. We also use numpy and a function of tensorflow but depending on how you build your own model is not necessary to import them.
%tensorflow_version 1.ximport tensorflow.keras as K
Training a model uses a lot of resources so we recommend using a GPU configuration in the Colab. This will speed up the process and allow more testing. We will talk about some other ways to improve computation soon.
CIFAR-10 is commonly used in Deep Learning for testing models of Image Classification. It has 60,000 color images comprising of 10 different categories. The image size is 32x32 and the dataset has 50,000 training images and 10,000 test images. One can find the CIFAR-10 dataset here.
The categories are airplane, automobile, beer, cat, deer, dog, frog, horse, ship, truck. We can take advantage of the fact that these categories and a lot more are into the Imagenet collection.
To load a database with Keras, we use:
Now that the data is loaded, we are going to build a preprocess function for the data. We have X as a numpy array of shape (m, 32, 32, 3) where m is the number of images, 32 and 32 the dimensions, and 3 is because we use color images (RGB). We have a set of X for training and a set of X for validation. Y is a numpy array of shape (m, ) that we want to be our labels. Since we work with 10 different categories, we make use of one-hot encoding with a function of Keras that makes our Y into a shape of (m, 10). That also applies for the validation.
As we said before, we are going to use ResNet50 but there are also many other models available with pre-trained weights such as VGG16, ResNet101, InceptionV3 and DenseNet121. Each one has its own preprocess function for the inputs.
Next, we are going to call our function with the parameters loaded from the CIFAR10 database. It’s important to get to know your data to monitor the steps and know how to build your model. Let’s print the shapes of our x_train and y_train before and after the preprocessing.
Here we unpack both train and test variables directly and bring them to the preprocess function.
((50000, 32, 32, 3), (50000, 1))
((50000, 32, 32, 3), (50000, 10))
Using weights of a trained neural network
A pretrained model from the Keras Applications has the advantage of allow you to use weights that are already calibrated to make predictions. In this case, we use the weights from Imagenet and the network is a ResNet50. The option include_top=False allows feature extraction by removing the last dense layers. This let us control the output and input of the model.
From this point it all comes to testing and a bit of creativity. The starting point is very advantageous since we have weights that already serve for image classification but since we are using it on a completely new dataset, there is a need for adjustments. Our objective is to build a model that has high accuracy in their classifications. In this case, if an image of a dog is presented, it successfully identifies it as a dog and not as a train, for example.
Let’s say we want to achieve an accuracy of more than 88% on training data but we also wish that it doesn’t have overfitting. How do we get this? Well at this point our models may diverge, this is where we test what tools we can use for that objective. The important here is to learn about transfer learning and making robust models. We follow an example but we can run with different approaches that we will discuss.
The two approaches you can take in transfer learning are:
- Feature extraction
- Fine tuning
This refers on how you use the layers of your pretrained model. We have already a very huge amount of parameters because of the number of layer of the ResNet50 but we have calibrated weights. We can choose to ‘freeze’ those layers (as many as you can) so those values doesn’t change, and by that way saving time and computational cost. However as the dataset is entirely different is not a bad idea to train all the model
In this case, we ‘freeze’ all layers except for the last block of the ResNet50. The way to do this in Keras is with:
We can check that we did it correctly with:
The output is something like this (the are more layer that we omit). False means that the layer is ‘freezed’ or is not trainable and True that when we run our model, the weights are going to be adjusted.
Later, we need to connect our pretrained model with the new layers of our model. We can use global pooling or a flatten layer to connect the dimensions of the previous layers with the new layers. With just a flatten layer and a dense layer with softmax we can perform close the model and start making classification.
model = K.models.Sequential()
The final layers are below, you can see the complete code here. However we explain some more aspects to improve the model and make a good classification. We present the main aspects taken into account to build the model.
We have regularizers to help us avoid overfitting and optimizers to get a faster result. Each of them can also affect our accuracy, so we present what to take into account. The most important are:
- Batch size: It is recommended to use a number of batch size with powers of 2 (8, 16, 32, 64, 128, …) because it fits with the memory of the computer.
- Learning rate: For transfer learning it is recommended a very low learning rate because we don’t want to change too much what is previously learned.
- Number of layers: This depends on how much you relay from the layers of the pretrained model. We found that if we leave all the model for training just a flatten layer and a dense with softmax is enough but since we incorporated the feature extraction it was required more layers at the end.
- Optimization methods: We tested with SGD and RMSprop. SGD with a very low learning required more epochs (30) to complete a razonable training. We used RMSprop with 5 epochs to get our result.
- Regularization methods: To avoid overfitting we used Batch normalization and dropout in-between the dense layers.
- Callbacks: In Keras, we can use callbacks in our model to perform certain actions in the training such as weight saving.
We obtained an accuracy of 96% on training set and 94% on validation with 10 epochs.
The summary of the model is below. We found that batch normalization and dropout greatly reduces overfitting and it helps get better accuracy on validation set. The method of ‘freezing layers’ allows a faster computation but hits the accuracy so it was necessary to add dense layers at the end. The shape of the layers holds part of the structure of the original ResNet50 like it was a continuation of it but with the features we mentioned.
For ResNet50 what helped more to achieve a high accuracy was to resize the input from 32 x 32 to 224 x 224. This is because of how the model was constructed which in this sense was not compatible with the dataset but it was easy to solve by fitting it to the original size of the architecture. There was the option of using UpSampling to do this task but we find that the use of Keras layers lambda was way faster.
We confirmed that ResNet50 works best with input images of 224 x 224. As CIFAR-10 have 32 x 32 images, it was necessary to perform a resize. With this adjustment alone, the model can achieve a high accuracy, I think it was the most important for ResNet50.
A good recommendation when building a model using transfer learning is to first test optimizers to get a low bias and good results in training set, then look for regularizers if you see overfitting over the validation set.
The discussion over using freezing on the pretrained model continues. It reduces computation time, reduces overffiting but lowers accuracy. When the new dataset is very different from the datased used for training it may be necessary to use more layer for adjustment.
On the selecting of hyperparameters, it is important for transfer learning to use a low learning rate to take advantage of the weights of the pretrained model. This choice as the optimizer choice (SGD, Adam, RMSprop) will impact the number of epochs needed to get a successfully trained model.