What is Transfer Learning? Transfer Learning in Keras | Fine Tuning Vs Feature Extraction

TL;DR

Transfer learning reuses a pretrained CNN to avoid costly manual labeling and slow training from scratch.

Briefing Cornell Notes

Briefing

Transfer learning is presented as the practical fix for two bottlenecks in deep learning: collecting and labeling huge datasets, and waiting days for models to train from scratch. Instead of training a CNN on a brand-new dataset, a model pretrained on a large benchmark dataset (like ImageNet) is reused on a new task. The core payoff is immediate—less data is needed and training time drops—because the pretrained network already learned general visual features.

The discussion starts with why building your own model is hard. Training a CNN typically requires thousands of labeled images, and labeling is manual and costly. Even when data exists, training on a large dataset can take a long time, discouraging teams from starting from zero. Pretrained models solve both issues by transferring knowledge from a previously trained CNN to a new dataset.

A key example anchors the concept: ImageNet pretraining. The talk references well-known architectures trained on ImageNet—VGG16, ResNet, and Inception—highlighting that these models were trained on roughly 1,000 classes and millions of images. The pretrained CNN contains two major parts: a convolutional base that extracts features from images, and fully connected layers that perform classification for the original task. Transfer learning works by keeping the convolutional base (which captures reusable, general features like edges and textures) and replacing or adapting the classification layers to match the new problem.

Two transfer-learning strategies are then contrasted: feature extraction and fine-tuning. In feature extraction, the convolutional base is frozen so its weights are not updated; only new classification layers are trained on the target dataset. This is framed as ideal when the target task is similar to the pretrained domain, because early layers learned “primitive” features that tend to generalize.

Fine-tuning is the more flexible approach. It still starts from the pretrained model, but it unfreezes some of the later convolutional layers (not necessarily the earliest ones) so the network can adapt higher-level features to the new task. The transcript uses a concrete scenario—phone versus tablet classification—arguing that if the target classes differ significantly from ImageNet’s categories, fine-tuning becomes more important. The tradeoff is cost: fine-tuning usually takes more time because more layers are trainable.

The second half moves into implementation details in Keras using VGG16. The workflow includes importing the VGG16 model with pretrained ImageNet weights, freezing the convolutional base, adding custom dense layers for the new binary classification, normalizing image pixel values, and training with a binary cross-entropy loss and the Adam optimizer. Results are reported for both approaches: feature extraction reaches about 91.4% test accuracy after applying data augmentation to reduce overfitting, while fine-tuning pushes accuracy higher—around 95.2%—with a noted risk of overfitting on training accuracy.

Overall, the transcript frames transfer learning as a “don’t reinvent the wheel” method: reuse pretrained feature extractors, then either train only the classifier head (feature extraction) or adapt deeper layers (fine-tuning) depending on how closely the new task matches the original training domain.

Cornell Notes

Transfer learning reuses a CNN pretrained on a large dataset (commonly ImageNet) to solve a new classification task with less labeled data and faster training. The pretrained model’s convolutional base learns general visual features, while the final classification layers are replaced to match the new labels. Feature extraction freezes the convolutional base and trains only the new dense layers; fine-tuning unfreezes some later convolutional layers so the model adapts to the target domain. In the Keras/VGG16 example, feature extraction with data augmentation improves test accuracy to about 91.4%, while fine-tuning (with a lower learning rate using RMSprop) reaches about 95.2%, at the cost of greater overfitting risk.

Why does transfer learning reduce both data requirements and training time?

Training from scratch needs large labeled datasets (the transcript cites thousands of images and manual labeling costs) and long compute time. Transfer learning starts from a CNN pretrained on a large benchmark dataset (ImageNet), so the model already learned reusable visual features. That means the target model needs fewer new labels and can converge faster because only the task-specific layers (or a subset of layers) are trained.

What are the two main parts of a pretrained CNN, and how are they reused?

The pretrained CNN is described as having (1) a convolutional base (feature extractor) and (2) fully connected layers (classifier). Transfer learning keeps the convolutional base because it captures general features like edges and textures, then replaces the classifier head with new dense layers and an output layer sized for the target task (binary in the example).

How does feature extraction differ from fine-tuning in practice?

Feature extraction freezes the convolutional base (weights are not updated) and trains only newly added dense layers for the new labels. Fine-tuning freezes fewer layers: it unfreezes later convolutional layers (e.g., “last” or “second-last” blocks) so higher-level features can adapt to the new task. The transcript notes fine-tuning usually takes more time because more parameters are trainable.

When should fine-tuning be preferred over feature extraction?

Fine-tuning is recommended when the target task is significantly different from the pretrained domain. The transcript’s phone-vs-tablet example argues that if ImageNet doesn’t contain similar classes, later layers may need adaptation. Unfreezing some later convolutional layers helps the model learn task-specific patterns beyond the generic features learned earlier.

What training setup details were used in the Keras/VGG16 example?

The workflow includes importing VGG16 with ImageNet weights, adding custom dense layers on top, normalizing pixel values to speed training, and training with binary cross-entropy loss. Feature extraction uses Adam, while fine-tuning uses RMSprop with a lower learning rate (motivated by the need for smaller updates when unfreezing layers). Data augmentation is applied in the feature-extraction run to reduce overfitting.

What accuracy results were reported for the two approaches?

Feature extraction with data augmentation is reported to reach about 91.4% test accuracy, with overfitting visible as a growing train/validation gap. Fine-tuning is reported to reach about 95.2% accuracy, with training accuracy around 99.8% and a noted overfitting risk that can be mitigated using techniques like data augmentation.

Review Questions

In transfer learning, which layers are typically frozen for feature extraction, and why?
What changes when moving from feature extraction to fine-tuning, and how does that affect training time and overfitting risk?
How do data augmentation and learning-rate choice influence the reported accuracy gap between training and validation?

Key Points

1
Transfer learning reuses a pretrained CNN to avoid costly manual labeling and slow training from scratch.
2
A pretrained model’s convolutional base captures general visual features that often transfer well to new tasks.
3
Feature extraction freezes the convolutional base and trains only new classification layers on the target dataset.
4
Fine-tuning unfreezes some later convolutional layers to adapt higher-level features when the target task differs from the pretrained domain.
5
In the Keras/VGG16 example, data augmentation reduced overfitting and improved feature-extraction test accuracy to about 91.4%.
6
Fine-tuning improved test accuracy further to about 95.2%, but training accuracy rose to near 99.8%, indicating overfitting risk.
7
Normalization, binary cross-entropy loss, and optimizer choice (Adam vs RMSprop with lower learning rate) are key practical details in the implementation.

Highlights

Transfer learning is framed as a direct solution to two real constraints: expensive labeling and long training times.

The convolutional base is treated as reusable “feature extraction” machinery, while the classifier head is swapped for the new labels.

Feature extraction freezes most of the network; fine-tuning unfreezes later blocks to adapt to domain differences.

Data augmentation is used to narrow the train/validation gap, improving generalization in the feature-extraction run.

Fine-tuning with a reduced learning rate (RMSprop) pushes accuracy higher but can amplify overfitting if not controlled.

Topics

Transfer Learning
Feature Extraction
Fine Tuning
VGG16
Keras Implementation

Mentioned

Keras
VGG16
ImageNet
CNN
FC
VGG16
Adam
RMSprop