
It is no question that Deep Learning is gaining massive popularity in recent years. But have you wonder how you can leverage the power of Deep Learning at the comfort of your home, without using massive computing resources from Amazon’s AWS or Microsoft’s Azure? In this tutorial, I will go over everything you need to know to master Keras transfer learning.
At a high level, I will build two simple neural networks in Keras using the power of ResNet50 pre-trained weights. Both networks are very similar such that they attempt to reach the same conclusion – to train a dataset as fast as possible while getting the accuracy of the prediction as high as possible. Despite the same goal, they differ in process – The first method is done through extracting inner image features called bottleneck features. And the second is done through frozen weights.
Training via Bottleneck Features
Before we get to the code, let us take a moment to fully understand what bottleneck features are and why they are so important.
Taking ResNet50 as an example, the first 50 convolution layers contains pre-trained weights which shall remained untouched and will be used exactly as-is to run through our dataset. At the output of the 50 layers, we would get a set of encoded images which we called bottleneck features. Relatively, any proceeding layers thereafter are called bottleneck layers.

Bottleneck features are extremely powerful due to its fire-and-forget nature. Specifically, once we extract all of the bottleneck features for our dataset of interest, we can forget about “how” ResNet50 was built and only care about “what” network created these features. In practical sense, we can simply save the bottleneck features on the disk. Next time when we load the features, we can just train the bottleneck layers in the matter of minutes and get great results!
A caviar in saving and reusing the bottleneck features is that each time new data is introduced in the input dataset, we must regenerate new bottleneck features for the updated input dataset. Due to the fact that bottleneck features are directly proportional to input dataset.
Please also note that the 1000 FC layer in the above figure is just an example. The bottleneck layer could be anything you wish to build, such as 512 -> 256 -> 2 FC layers or 1000 -> 5 FC layers.
Now the fun part begins!
Data Preprocessing
For our dataset, we are going to use CIFAR10 dataset which is conveniently provided by Keras through simple API.
from keras import backend as K from keras.datasets import cifar10 (x_train, y_train), (x_test, y_test) = cifar10.load_data()
For those who are unfamiliar with cifar10 dataset, do not worry. As with any dataset we haven’t seen, a good rule of thumb is to always take a peek visually at the first couple images.
Using Python’s matplotlib library to visualize the first 4 training set:
import matplotlib.pyplot as plt sample_num = 4 fig = plt.figure(figsize=(12, 12)) for i in range(sample_num): fig.add_subplot(1, sample_num, i+1).title.set_text(f'Shape: {x_train[i].shape}') plt.imshow(x_train[i])
Produces the output:

As you can see, the dataset contains many different things – frog, truck, truck, and deer. All of them in the same shapes, 32 x 32 pixels with 3 color channels.
Now that we have an idea of the input images, let’s take a peek at the expected output data:
print(y_train[:sample_num])
We get
[[6], [9], [9], [4]]
Looks cryptic doesn’t it? Let’s dig a little bit deeper. Since we are building ResNet50, a model that is typically built for image-to-label detection, we know these numbers must have been tokenized. After a quick Google search confirms that CIFAR10 is indeed an image-to-label dataset – a correct dataset type matching the network type.
Now retrying with provided label mapping:
import numpy as np # Taken from official cifar: https://www.cs.toronto.edu/~kriz/cifar.html labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] print(np.take(labels, y_train[:sample_num]))
Indeed, we now see somewhat meaningful outputs:
[['frog'], ['truck'], ['truck'], ['deer']]
Now with a good idea of what kinds of data we’re dealing with, we must take care of another important pre-processing step in order to leverage the pre-trained weights from ResNet50.
Due to the fact that Keras’ ResNet50 was trained using the imagenet dataset, an additional image normalization was applied to the training dataset during training. For us to reuse the weights with our own dataset, we must also abide by the same rules as was done previously. Luckily, Keras has generously supply us with the pre-processing API that is as easy as making a few function calls:
from keras.applications.resnet50 import preprocess_input from PIL import Image def resize(arr, shape): return np.array(Image.fromarray(arr).resize(shape)) # Reducing train and test dataset to showcase efficiency in pre-trained net with small datasets reduce_train = 1000 reduce_test = 100 x_train, x_test, y_train, y_test = x_train[:reduce_train], x_test[:reduce_test], y_train[:reduce_train], y_test[:reduce_test] # Resize all images to 139x139 for faster training x_train = np.array([resize(x_train[i], (139, 139)) for i in range(0, len(x_train))]).astype('float32') x_test = np.array([resize(x_test[i], (139, 139)) for i in range(0, len(x_test))]).astype('float32') x_train = preprocess_input(x_train) x_test = preprocess_input(x_test)
Extracting Bottleneck Features
With our processed data lock and loaded, now is the time to download our ResNet50 model and its pre-trained weights:
from keras.applications.resnet50 import ResNet50 resnet50 = ResNet50(weights='imagenet', include_top=False, input_shape=(139, 139, 3)) bottleneck_train_features = resnet50.predict(x_train) bottleneck_test_features = resnet50.predict(x_test)
In just 3 lines of code, we have our train and test bottleneck features! To save these features for later use, we use our nifty numpy to do so:
import numpy as np np.save('./resnet50_bottleneck_features.npy', {'train': bottleneck_train_features, 'test': bottleneck_test_features})
Training Bottleneck Layers
For the rest of the network we’re going to build, we won’t be building something as deep as the pre-trained ResNet50. Instead, a simple 2-3 NN consisting of CNN and FC layers should be more than enough to compliment the pre-trained net:
# Restore the bottleneck features that was saved previously restored_features = np.load('./resnet50_bottleneck_features.npy', allow_pickle=True) bottleneck_train_features = restored_features['train'] bottleneck_test_features = restored_features['test'] in_layer = Input(shape=(bottleneck_train_features.shape[1:])) x = Conv2D(filters=100, kernel_size=2)(in_layer) x = Dropout(0.4)(x) x = GlobalAveragePooling2D()(x) x = Dropout(0.3)(x) predictions = Dense(num_classes, activation='softmax')(x) model = Model(inputs=in_layer, outputs=predictions) model.summary()
Let’s take a look at what we built. The NN starts with an input layer in which we feed our bottleneck feature into, followed by a small Conv2D layer consisting of 100 filters, 3×3 kernel. Then followed by 2 Dense (fully-connected) layers: one with 256 cells, and the final one with cells equal to the number of total classes (10).
Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 5, 5, 2048) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 4, 4, 100) 819300 _________________________________________________________________ global_average_pooling2d_1 ( (None, 100) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 100) 0 _________________________________________________________________ dense_1 (Dense) (None, 256) 25856 _________________________________________________________________ activation_50 (Activation) (None, 256) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 256) 0 _________________________________________________________________ dense_2 (Dense) (None, 10) 2570 ================================================================= Total params: 847,726 Trainable params: 847,726 Non-trainable params: 0
Finally, we kick off our training with 50 epochs and some helper callback functions to gauge how well our model is training:
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['acc']) model.fit(bottleneck_train_features, y_train, batch_size=32, epochs=50, validation_data=(bottleneck_test_features, y_test), shuffle=True)
The Verdict
Train on 1000 samples, validate on 100 samples Epoch 1/50 1000/1000 [==============================] - 13s 13ms/step - loss: 4.5964 - acc: 0.4100 - val_loss: 2.8857 - val_acc: 0.5600 Epoch 2/50 1000/1000 [==============================] - 2s 2ms/step - loss: 2.2064 - acc: 0.6880 - val_loss: 2.8754 - val_acc: 0.5400 Epoch 3/50 1000/1000 [==============================] - 2s 2ms/step - loss: 1.5534 - acc: 0.7650 - val_loss: 2.5624 - val_acc: 0.6600 Epoch 4/50 1000/1000 [==============================] - 2s 2ms/step - loss: 1.2625 - acc: 0.7950 - val_loss: 2.1131 - val_acc: 0.6800 Epoch 5/50 1000/1000 [==============================] - 2s 2ms/step - loss: 0.8929 - acc: 0.8490 - val_loss: 2.7099 - val_acc: 0.6100 Epoch 6/50 1000/1000 [==============================] - 2s 2ms/step - loss: 1.0253 - acc: 0.8350 - val_loss: 2.1196 - val_acc: 0.7100 Epoch 7/50 1000/1000 [==============================] - 2s 2ms/step - loss: 0.8658 - acc: 0.8380 - val_loss: 2.1458 - val_acc: 0.7400 Epoch 8/50 1000/1000 [==============================] - 2s 2ms/step - loss: 0.4506 - acc: 0.8990 - val_loss: 2.1909 - val_acc: 0.7300 Epoch 9/50 1000/1000 [==============================] - 2s 2ms/step - loss: 0.3934 - acc: 0.9120 - val_loss: 2.3684 - val_acc: 0.7400
Not bad at all! We managed to get to 74% test accuracy in 31 seconds. For such a small dataset (also small original image resolution – 32×32), the model we built here barely cost a dime in terms of resource and can even be trained on your good ol’ CPU.
Training via Weight Freezing
Another method of utilizing pre-trained weights is via weight freezing. Instead of extracting bottleneck features from ResNet50 and throwing the model away, this time we’re actually going to keep the pre-trained ResNet50 model around in the memory while throwing out the intermediate bottleneck features. To put it simply, we are exchanging memory for lesser performance, which I will explain later in more details.
To begin, we’re going to reuse the pre-processing steps we had earlier. All those steps are still important regardless or which training method we end up choosing. Instead of building the Keras Model only for our custom bottleneck layers, this time we’re going to connect the input of the model to the input of ResNet50 and the output of the model to the output of the bottleneck layer. All the while freezing the ResNet50 layers:
legacy_keras_with_bn_bug = version.parse(keras.__version__) < version.parse('2.1.3') for layer in resnet50.layers: layer.trainable = False if legacy_keras_with_bn_bug and isinstance(layer, BatchNormalization): layer._per_input_updates = {} x = Conv2D(filters=100, kernel_size=2)(resnet50.output) x = Dropout(0.4)(x) x = GlobalAveragePooling2D()(x) x = Dropout(0.3)(x) predictions = Dense(num_classes, activation='softmax')(x) model = Model(inputs=resnet50.input, outputs=predictions) model.summary()
*Note: Due to a bug in Keras < 2.1.3, batch normalization may not freeze property by calling layer.trainable=False and requires a workaround.
The summary of the model (printing only the first and last few of the model):
Model: "model_4" __________________________________________________________________________ Layer (type) Output Shape Param # ========================================================================== input_4 (InputLayer) (None, 139, 139, 3) 0 __________________________________________________________________________ conv1_pad (ZeroPadding2D) (None, 145, 145, 3) 0 __________________________________________________________________________ conv1 (Conv2D) (None, 70, 70, 64) 9472 __________________________________________________________________________ bn_conv1 (BatchNormalization) (None, 70, 70, 64) 256 __________________________________________________________________________ activation_148 (Activation) (None, 70, 70, 64) 0 __________________________________________________________________________ pool1_pad (ZeroPadding2D) (None, 72, 72, 64) 0 __________________________________________________________________________ max_pooling2d_4 (MaxPooling2D) (None, 35, 35, 64) 0 __________________________________________________________________________ add_64 (Add) (None, 5, 5, 2048) 0 __________________________________________________________________________ activation_196 (Activation) (None, 5, 5, 2048) 0 ... __________________________________________________________________________ conv2d_5 (Conv2D) (None, 4, 4, 100) 819300 __________________________________________________________________________ dropout_9 (Dropout) (None, 4, 4, 100) 0 __________________________________________________________________________ global_average_pooling2d_5 (Glo (None, 100) 0 __________________________________________________________________________ dropout_10 (Dropout) (None, 100) 0 __________________________________________________________________________ dense_4 (Dense) (None, 10) 1010 ========================================================================== Total params: 24,408,022 Trainable params: 820,310 Non-trainable params: 23,587,712
Doing a quick comparison with the bottleneck extraction model above, the number of trainable params are roughly the same, but the non-trainable params is a whopping 28.75x of the trainable params. Which means that for each epoch in a training session, the data needs to go through 28.75 times of the parameters that we actually care about training. The rest of the computations lie within pre-trained layers are largely repeated.
Lastly, we kick off the training loop once again:
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['acc']) model.fit(x_train, y_train, batch_size=32, epochs=50, validation_data=(x_test, y_test), shuffle=True)
The Verdict
Epoch 1/50 1000/1000 [==============================] - 131s 131ms/step - loss: 4.9892 - acc: 0.4420 - val_loss: 6.3795 - val_acc: 0.4400 Epoch 2/50 1000/1000 [==============================] - 128s 128ms/step - loss: 3.4056 - acc: 0.6670 - val_loss: 5.2158 - val_acc: 0.5067 Epoch 3/50 1000/1000 [==============================] - 127s 127ms/step - loss: 2.2991 - acc: 0.7610 - val_loss: 6.7120 - val_acc: 0.4333 Epoch 4/50 1000/1000 [==============================] - 128s 128ms/step - loss: 1.6426 - acc: 0.7990 - val_loss: 7.2225 - val_acc: 0.4467 Epoch 5/50 1000/1000 [==============================] - 128s 128ms/step - loss: 1.3637 - acc: 0.8370 - val_loss: 6.2196 - val_acc: 0.4800 Epoch 6/50 1000/1000 [==============================] - 128s 128ms/step - loss: 0.9071 - acc: 0.8770 - val_loss: 7.1397 - val_acc: 0.4433 Epoch 7/50 1000/1000 [==============================] - 128s 128ms/step - loss: 0.7794 - acc: 0.8950 - val_loss: 7.0817 - val_acc: 0.4500 Epoch 8/50 1000/1000 [==============================] - 128s 128ms/step - loss: 0.8058 - acc: 0.8920 - val_loss: 7.8264 - val_acc: 0.4200 Epoch 9/50 1000/1000 [==============================] - 128s 128ms/step - loss: 0.8408 - acc: 0.8970 - val_loss: 7.5383 - val_acc: 0.4500
As seen previously in model summary comparisons, the weight freeze method is several orders of magnitude less efficient than its bottleneck feature counterpart. It trains 10x longer with much less validation accuracy as a result. The training session was cut short after 9 epochs (~17 min) without seeing reasonable improvement running on a CPU.
The Final Verdict
Both methods of utilizing Keras’s pre-trained weights have their pros and cons. While the bottleneck feature extraction takes up much more memory in large dataset (required to store all bottleneck features from the entire dataset), it makes up by its massive increase in performance. On the other hand, the weight freezing method requires a high computation power to train each epoch, it makes up by having the ability to split up large dataset into small batch sizes, which in turn saves massive memory usage.
The example in this article is trained on 3000 32×32 images which is considered puny in in the competitive deep learning world. Therefore, going with the bottleneck feature extraction method may prevail over its counterpart in this specific case. If our dataset was different and much larger in number and size, the choice may not have been so clear.
In conclusion, there isn’t a best method that suits all. It all depends on your needs and resources. Do you have more storage than computing resources? Do you have a decent GPU but lacks storage? How large is your dataset? 100 32×32 images or 10k 6MP images? If you are still unsure about which approach to go with, it might be worthwhile to try out both of them before settling on a conclusion.
A complete code in this tutorial is available on my GitHub repo. Let me know what you end up choosing for your particular situation down below or here. Until next time, happy experimenting!

Jayce is a technology enthusiast, career advocate, and machine learning engineer. During leisure time, he enjoys taking adventure to all things new, whether it be places or food.
Join the discussion
6 replies to “Keras Transfer Learning”
Im obliged for the article.Much thanks again. Cool.
I really enjoyed this post! I have a blog myself, if you are open, I might curate this post to my audience. Obviously give a link back.
Hey, I’m a long time fan and reader of your blog, first time commenter. Just wanted to say this post really hit home with the stuff I’ve been looking into. Thanks!
Thank you for reading 😊
Thanks for the article post.Really looking forward to read more. Much obliged.