Keras Transfer Learning

machine learning

It is no question that Deep Learning is gaining massive popularity in recent years. But have you wonder how you can leverage the power of Deep Learning at the comfort of your home, without using massive computing resources from Amazon’s AWS or Microsoft’s Azure? In this tutorial, I will go over everything you need to know to master Keras transfer learning.

At a high level, I will build two simple neural networks in Keras using the power of ResNet50 pre-trained weights. Both networks are very similar such that they attempt to reach the same conclusion – to train a dataset as fast as possible while getting the accuracy of the prediction as high as possible. Despite the same goal, they differ in process – The first method is done through extracting inner image features called bottleneck features. And the second is done through frozen weights.

Training via Bottleneck Features

Before we get to the code, let us take a moment to fully understand what bottleneck features are and why they are so important.

Taking ResNet50 as an example, the first 50 convolution layers contains pre-trained weights which shall remained untouched and will be used exactly as-is to run through our dataset. At the output of the 50 layers, we would get a set of encoded images which we called bottleneck features. Relatively, any proceeding layers thereafter are called bottleneck layers.

ResNet50 deep learning architecture
Figure 1. Bottleneck Features in the diagram is the output features from the last max-pool layer, on the blue line in the far right. Bottleneck layers (aka top) are the custom layers after the blue line. that is meant to fine tune your small dataset.

Bottleneck features are extremely powerful due to its fire-and-forget nature. Specifically, once we extract all of the bottleneck features for our dataset of interest, we can forget about “how” ResNet50 was built and only care about “what” network created these features. In practical sense, we can simply save the bottleneck features on the disk. Next time when we load the features, we can just train the bottleneck layers in the matter of minutes and get great results!

A caviar in saving and reusing the bottleneck features is that each time new data is introduced in the input dataset, we must regenerate new bottleneck features for the updated input dataset. Due to the fact that bottleneck features are directly proportional to input dataset.

Please also note that the 1000 FC layer in the above figure is just an example. The bottleneck layer could be anything you wish to build, such as 512 -> 256 -> 2 FC layers or 1000 -> 5 FC layers.

Now the fun part begins!

Data Preprocessing

For our dataset, we are going to use CIFAR10 dataset which is conveniently provided by Keras through simple API.

from keras import backend as K
from keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

For those who are unfamiliar with cifar10 dataset, do not worry. As with any dataset we haven’t seen, a good rule of thumb is to always take a peek visually at the first couple images.

Using Python’s matplotlib library to visualize the first 4 training set:

import matplotlib.pyplot as plt

sample_num = 4
fig = plt.figure(figsize=(12, 12))
for i in range(sample_num):
  fig.add_subplot(1, sample_num, i+1).title.set_text(f'Shape: {x_train[i].shape}')

Produces the output:

cifar10 train images
First 4 images in the train set

As you can see, the dataset contains many different things – frog, truck, truck, and deer. All of them in the same shapes, 32 x 32 pixels with 3 color channels.

Now that we have an idea of the input images, let’s take a peek at the expected output data:


We get

[[6], [9], [9], [4]]

Looks cryptic doesn’t it? Let’s dig a little bit deeper. Since we are building ResNet50, a model that is typically built for image-to-label detection, we know these numbers must have been tokenized. After a quick Google search confirms that CIFAR10 is indeed an image-to-label dataset – a correct dataset type matching the network type.

Now retrying with provided label mapping:

import numpy as np

# Taken from official cifar:
labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
print(np.take(labels, y_train[:sample_num]))

Indeed, we now see somewhat meaningful outputs:

[['frog'], ['truck'], ['truck'], ['deer']]

Now with a good idea of what kinds of data we’re dealing with, we must take care of another important pre-processing step in order to leverage the pre-trained weights from ResNet50.

Due to the fact that Keras’ ResNet50 was trained using the imagenet dataset, an additional image normalization was applied to the training dataset during training. For us to reuse the weights with our own dataset, we must also abide by the same rules as was done previously. Luckily, Keras has generously supply us with the pre-processing API that is as easy as making a few function calls:

from keras.applications.resnet50 import preprocess_input
from PIL import Image

def resize(arr, shape):
  return np.array(Image.fromarray(arr).resize(shape))

# Reducing train and test dataset to showcase efficiency in pre-trained net with small datasets
reduce_train = 1000
reduce_test = 100
x_train, x_test, y_train, y_test = x_train[:reduce_train], x_test[:reduce_test], y_train[:reduce_train], y_test[:reduce_test]

# Resize all images to 139x139 for faster training
x_train = np.array([resize(x_train[i], (139, 139)) for i in range(0, len(x_train))]).astype('float32')
x_test = np.array([resize(x_test[i], (139, 139)) for i in range(0, len(x_test))]).astype('float32')
x_train = preprocess_input(x_train)
x_test = preprocess_input(x_test)

Extracting Bottleneck Features

With our processed data lock and loaded, now is the time to download our ResNet50 model and its pre-trained weights:

from keras.applications.resnet50 import ResNet50

resnet50 = ResNet50(weights='imagenet', include_top=False, input_shape=(139, 139, 3))
bottleneck_train_features = resnet50.predict(x_train)
bottleneck_test_features = resnet50.predict(x_test)

In just 3 lines of code, we have our train and test bottleneck features! To save these features for later use, we use our nifty numpy to do so:

import numpy as np'./resnet50_bottleneck_features.npy', {'train': bottleneck_train_features, 'test': bottleneck_test_features})

Training Bottleneck Layers

For the rest of the network we’re going to build, we won’t be building something as deep as the pre-trained ResNet50. Instead, a simple 2-3 NN consisting of CNN and FC layers should be more than enough to compliment the pre-trained net:

# Restore the bottleneck features that was saved previously
restored_features = np.load('./resnet50_bottleneck_features.npy', allow_pickle=True)
bottleneck_train_features = restored_features['train']
bottleneck_test_features = restored_features['test']

in_layer = Input(shape=(bottleneck_train_features.shape[1:]))
x = Conv2D(filters=100, kernel_size=2)(in_layer)
x = Dropout(0.4)(x)
x = GlobalAveragePooling2D()(x)
x = Dropout(0.3)(x)
predictions = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=in_layer, outputs=predictions)

Let’s take a look at what we built. The NN starts with an input layer in which we feed our bottleneck feature into, followed by a small Conv2D layer consisting of 100 filters, 3×3 kernel. Then followed by 2 Dense (fully-connected) layers: one with 256 cells, and the final one with cells equal to the number of total classes (10).

Model: "model_1"
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 5, 5, 2048)        0         
conv2d_1 (Conv2D)            (None, 4, 4, 100)         819300    
global_average_pooling2d_1 ( (None, 100)               0         
dropout_1 (Dropout)          (None, 100)               0         
dense_1 (Dense)              (None, 256)               25856     
activation_50 (Activation)   (None, 256)               0         
dropout_2 (Dropout)          (None, 256)               0         
dense_2 (Dense)              (None, 10)                2570      
Total params: 847,726
Trainable params: 847,726
Non-trainable params: 0

Finally, we kick off our training with 50 epochs and some helper callback functions to gauge how well our model is training:

model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['acc']), y_train, batch_size=32, epochs=50, validation_data=(bottleneck_test_features, y_test), shuffle=True)

The Verdict

Train on 1000 samples, validate on 100 samples
Epoch 1/50
1000/1000 [==============================] - 13s 13ms/step - loss: 4.5964 - acc: 0.4100 - val_loss: 2.8857 - val_acc: 0.5600
Epoch 2/50
1000/1000 [==============================] - 2s 2ms/step - loss: 2.2064 - acc: 0.6880 - val_loss: 2.8754 - val_acc: 0.5400
Epoch 3/50
1000/1000 [==============================] - 2s 2ms/step - loss: 1.5534 - acc: 0.7650 - val_loss: 2.5624 - val_acc: 0.6600
Epoch 4/50
1000/1000 [==============================] - 2s 2ms/step - loss: 1.2625 - acc: 0.7950 - val_loss: 2.1131 - val_acc: 0.6800
Epoch 5/50
1000/1000 [==============================] - 2s 2ms/step - loss: 0.8929 - acc: 0.8490 - val_loss: 2.7099 - val_acc: 0.6100
Epoch 6/50
1000/1000 [==============================] - 2s 2ms/step - loss: 1.0253 - acc: 0.8350 - val_loss: 2.1196 - val_acc: 0.7100
Epoch 7/50
1000/1000 [==============================] - 2s 2ms/step - loss: 0.8658 - acc: 0.8380 - val_loss: 2.1458 - val_acc: 0.7400
Epoch 8/50
1000/1000 [==============================] - 2s 2ms/step - loss: 0.4506 - acc: 0.8990 - val_loss: 2.1909 - val_acc: 0.7300
Epoch 9/50
1000/1000 [==============================] - 2s 2ms/step - loss: 0.3934 - acc: 0.9120 - val_loss: 2.3684 - val_acc: 0.7400

Not bad at all! We managed to get to 74% test accuracy in 31 seconds. For such a small dataset (also small original image resolution – 32×32), the model we built here barely cost a dime in terms of resource and can even be trained on your good ol’ CPU.

Training via Weight Freezing

Another method of utilizing pre-trained weights is via weight freezing. Instead of extracting bottleneck features from ResNet50 and throwing the model away, this time we’re actually going to keep the pre-trained ResNet50 model around in the memory while throwing out the intermediate bottleneck features. To put it simply, we are exchanging memory for lesser performance, which I will explain later in more details.

To begin, we’re going to reuse the pre-processing steps we had earlier. All those steps are still important regardless or which training method we end up choosing. Instead of building the Keras Model only for our custom bottleneck layers, this time we’re going to connect the input of the model to the input of ResNet50 and the output of the model to the output of the bottleneck layer. All the while freezing the ResNet50 layers:

legacy_keras_with_bn_bug = version.parse(keras.__version__) < version.parse('2.1.3')
for layer in resnet50.layers:
  layer.trainable = False
  if legacy_keras_with_bn_bug and isinstance(layer, BatchNormalization):
    layer._per_input_updates = {}

x = Conv2D(filters=100, kernel_size=2)(resnet50.output)
x = Dropout(0.4)(x)
x = GlobalAveragePooling2D()(x)
x = Dropout(0.3)(x)
predictions = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=resnet50.input, outputs=predictions)

*Note: Due to a bug in Keras < 2.1.3, batch normalization may not freeze property by calling layer.trainable=False and requires a workaround.

The summary of the model (printing only the first and last few of the model):

Model: "model_4"
Layer (type)                    Output Shape         Param #                        
input_4 (InputLayer)            (None, 139, 139, 3)  0                                            
conv1_pad (ZeroPadding2D)       (None, 145, 145, 3)  0                          
conv1 (Conv2D)                  (None, 70, 70, 64)   9472                        
bn_conv1 (BatchNormalization)   (None, 70, 70, 64)   256                           
activation_148 (Activation)     (None, 70, 70, 64)   0                            
pool1_pad (ZeroPadding2D)       (None, 72, 72, 64)   0                      
max_pooling2d_4 (MaxPooling2D)  (None, 35, 35, 64)   0                       
add_64 (Add)                    (None, 5, 5, 2048)   0                      
activation_196 (Activation)     (None, 5, 5, 2048)   0                 

conv2d_5 (Conv2D)               (None, 4, 4, 100)    819300                 
dropout_9 (Dropout)             (None, 4, 4, 100)    0                           
global_average_pooling2d_5 (Glo (None, 100)          0                         
dropout_10 (Dropout)            (None, 100)          0           
dense_4 (Dense)                 (None, 10)           1010                  
Total params: 24,408,022
Trainable params: 820,310
Non-trainable params: 23,587,712

Doing a quick comparison with the bottleneck extraction model above, the number of trainable params are roughly the same, but the non-trainable params is a whopping 28.75x of the trainable params. Which means that for each epoch in a training session, the data needs to go through 28.75 times of the parameters that we actually care about training. The rest of the computations lie within pre-trained layers are largely repeated.

Lastly, we kick off the training loop once again:

model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['acc']), y_train, batch_size=32, epochs=50, validation_data=(x_test, y_test), shuffle=True)

The Verdict

Epoch 1/50
1000/1000 [==============================] - 131s 131ms/step - loss: 4.9892 - acc: 0.4420 - val_loss: 6.3795 - val_acc: 0.4400
Epoch 2/50
1000/1000 [==============================] - 128s 128ms/step - loss: 3.4056 - acc: 0.6670 - val_loss: 5.2158 - val_acc: 0.5067
Epoch 3/50
1000/1000 [==============================] - 127s 127ms/step - loss: 2.2991 - acc: 0.7610 - val_loss: 6.7120 - val_acc: 0.4333
Epoch 4/50
1000/1000 [==============================] - 128s 128ms/step - loss: 1.6426 - acc: 0.7990 - val_loss: 7.2225 - val_acc: 0.4467
Epoch 5/50
1000/1000 [==============================] - 128s 128ms/step - loss: 1.3637 - acc: 0.8370 - val_loss: 6.2196 - val_acc: 0.4800
Epoch 6/50
1000/1000 [==============================] - 128s 128ms/step - loss: 0.9071 - acc: 0.8770 - val_loss: 7.1397 - val_acc: 0.4433
Epoch 7/50
1000/1000 [==============================] - 128s 128ms/step - loss: 0.7794 - acc: 0.8950 - val_loss: 7.0817 - val_acc: 0.4500
Epoch 8/50
1000/1000 [==============================] - 128s 128ms/step - loss: 0.8058 - acc: 0.8920 - val_loss: 7.8264 - val_acc: 0.4200
Epoch 9/50
1000/1000 [==============================] - 128s 128ms/step - loss: 0.8408 - acc: 0.8970 - val_loss: 7.5383 - val_acc: 0.4500

As seen previously in model summary comparisons, the weight freeze method is several orders of magnitude less efficient than its bottleneck feature counterpart. It trains 10x longer with much less validation accuracy as a result. The training session was cut short after 9 epochs (~17 min) without seeing reasonable improvement running on a CPU.

The Final Verdict

Both methods of utilizing Keras’s pre-trained weights have their pros and cons. While the bottleneck feature extraction takes up much more memory in large dataset (required to store all bottleneck features from the entire dataset), it makes up by its massive increase in performance. On the other hand, the weight freezing method requires a high computation power to train each epoch, it makes up by having the ability to split up large dataset into small batch sizes, which in turn saves massive memory usage.

The example in this article is trained on 3000 32×32 images which is considered puny in in the competitive deep learning world. Therefore, going with the bottleneck feature extraction method may prevail over its counterpart in this specific case. If our dataset was different and much larger in number and size, the choice may not have been so clear.

In conclusion, there isn’t a best method that suits all. It all depends on your needs and resources. Do you have more storage than computing resources? Do you have a decent GPU but lacks storage? How large is your dataset? 100 32×32 images or 10k 6MP images? If you are still unsure about which approach to go with, it might be worthwhile to try out both of them before settling on a conclusion.

A complete code in this tutorial is available on my GitHub repo. Let me know what you end up choosing for your particular situation down below or here. Until next time, happy experimenting!

Show comments

Join the discussion

6 replies to “Keras Transfer Learning”

  1. Im obliged for the article.Much thanks again. Cool.

  2. idapt says:

    I really enjoyed this post! I have a blog myself, if you are open, I might curate this post to my audience. Obviously give a link back.

  3. Pete says:

    Hey, I’m a long time fan and reader of your blog, first time commenter. Just wanted to say this post really hit home with the stuff I’ve been looking into. Thanks!

    1. Jayce says:Author

      Thank you for reading 😊

  4. Thanks for the article post.Really looking forward to read more. Much obliged.

Leave a Reply

Your email address will not be published. Required fields are marked *