Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, Andrew Zisserman

arXiv (Cornell University)·2014·Computer Science·75,499 citations

9 min read

Read the full paper at DOI or on arxiv

TL;DR

The paper evaluates depth as a primary architectural variable for ConvNets on ImageNet, using a consistent design family with 3×3 convolutions and ReLU.

Briefing Cornell Notes

Briefing

This paper asks a focused but high-impact question for large-scale computer vision: how does convolutional network depth affect accuracy when other architectural factors are held largely constant? The authors’ motivation is that, as ConvNets became the dominant approach for ImageNet-scale recognition, “depth” was increasingly suspected to be a key driver of performance, but the field lacked a careful, controlled evaluation of very deep plain convolutional architectures using a consistent design philosophy. Depth matters because it changes the representational capacity of the network and the hierarchy of features learned from pixels to semantics; it also affects optimization stability and computational trade-offs. In the broader context of the 2012–2014 ImageNet era, the work sits alongside other “go deeper” efforts, but distinguishes itself by using a simple, systematic architecture family and by demonstrating that substantial gains can be achieved without exotic modules.

Methodologically, the study is an empirical architecture comparison. The authors define a generic ConvNet pipeline for 224×224 RGB inputs: stacks of convolutional layers using very small 3×3 filters (stride 1 with padding to preserve spatial resolution), interleaved with max-pooling layers (2×2, stride 2). Rectified linear units (ReLU) are used after hidden layers. The network ends with three fully connected layers (4096, 4096, and 1000-way classification) and a softmax. They evaluate multiple configurations that differ primarily in depth: network A has 11 weight layers (8 conv + 3 FC), while network E has 19 weight layers (16 conv + 3 FC). Configuration C additionally inserts 1×1 convolutions to increase non-linearity without changing receptive field size; configuration A-LRN includes local response normalization, which is later shown not to help.

A key design principle is that two or more 3×3 convolutions approximate larger effective receptive fields (e.g., two 3×3 layers yield an effective 5×5 receptive field) while increasing the number of non-linearities and reducing parameter count compared to a single larger filter. The authors also argue that 1×1 convolutions can add non-linearity and channel mixing.

Training is performed on ImageNet ILSVRC-2012 with standard splits: 1.3M training images, 50K validation images, and 100K test images. Optimization uses mini-batch gradient descent with momentum (batch size 256, momentum 0.9), multinomial logistic regression loss for classification, weight decay with an L2 penalty multiplier of 5×10^-4, and dropout (0.5) on the first two fully connected layers. Learning rate starts at 10^-2 and is reduced by a factor of 10 three times when validation accuracy plateaus; training stops after 370K iterations (74 epochs). For deeper networks, they use a staged initialization strategy: train the shallowest model A from random initialization, then initialize the first four convolutional layers and the last three fully connected layers of deeper models from A, while randomly initializing intermediate layers. Data augmentation includes random cropping from rescaled images, random horizontal flips, and random RGB color shifts. They compare fixed-scale training (smallest side S=256 or 384) versus multi-scale training where S is jittered in [256, 512].

Evaluation is done with several inference protocols. For classification, they convert fully connected layers to convolutional layers to enable dense evaluation over the whole image, then spatially average class score maps. They also test multi-scale evaluation by running the model at multiple test scales Q and averaging predictions, and multi-crop evaluation using a grid of crops with flips. Finally, they study model ensembling by averaging softmax outputs across multiple trained networks.

The key findings are that increasing depth (up to 16–19 weight layers) yields consistent accuracy improvements, and that the best results come from combining depth with scale jittering and multi-scale/dense inference. On single-scale validation evaluation, top-1 error decreases as depth increases: network A (11 layers) achieves 29.6% top-1 error (top-5 10.4%), network B (13 layers) improves to 28.7% (9.9%), network C (16 layers with 1×1 convs) reaches 28.1% (9.4%), network D (16 layers with 3×3 convs throughout) reaches 27.0% (8.8%), and network E (19 layers) reaches 27.3% (9.0%) at S=256/Q=256. When training uses scale jittering (S in [256, 512]) and testing uses Q=384, the deepest models improve further: network D reaches 25.6% top-1 error (8.1% top-5) and network E reaches 25.5% top-1 error (8.0% top-5). The authors also report that adding 1×1 layers without preserving spatial context (C vs D) is worse than using 3×3 layers throughout, indicating that spatial receptive field growth via depth is important.

Multi-scale evaluation amplifies these gains. With scale jittering during training and testing over Q ∈ {256, 384, 512}, the best single-network validation performance is 24.8% top-1 error and 7.5% top-5 error (for both D and E under the reported settings). On the test set, configuration E achieves 7.3% top-5 error. Comparing dense evaluation to multi-crop evaluation shows multi-crop is slightly better, and combining them is better still: for network D, dense gives 24.8%/7.5% (top-1/top-5), multi-crop gives 24.6%/7.5%, and their combination yields 24.4%/7.2%. For network E, the combined approach yields 24.4%/7.1%.

Ensembling further improves performance. The authors describe an ILSVRC-2014 submission ensemble of 7 networks achieving 7.3% top-5 test error. After the submission, they reduce to an ensemble of two best multi-scale models (D and E), achieving 7.0% top-5 error with dense evaluation and 6.8% when combining dense and multi-crop evaluation. Their best single model achieves 7.1% test error (network E).

In comparison to prior art on ILSVRC-2014 classification, their “VGG” approach (denoted as VGG) substantially outperforms earlier systems. The paper reports that GoogLeNet (1 net) has 7.9% top-5 test error, while VGG (1 net) achieves 7.0% and VGG (2 nets) achieves 6.8%. They also compare against Clarifai (multiple nets) with 11.7% top-5 test error and OverFeat (1 net) with 13.6% top-5 test error (numbers shown in their table without outside training data). The contribution is thus both architectural (depth) and practical (a training/evaluation recipe that yields top-tier results with relatively simple components).

The paper also extends the depth-based representation to localization and transfer learning. For ILSVRC-2014 localization, they adapt a deep ConvNet where the final layer regresses bounding box parameters (center coordinates, width, height). They compare class-agnostic single-class regression (SCR) versus per-class regression (PCR) and find PCR is better. Under their simplified validation protocol, fine-tuning only the first two fully connected layers with SCR yields 36.4% localization error, switching to PCR yields 34.3%, and fine-tuning all layers with PCR yields 33.1%. In the fully-fledged evaluation, using dense application and merging predictions, their best system achieves 25.3% top-5 localization error on the test set, winning the localization challenge.

Finally, the authors test generalization by using the trained networks as fixed feature extractors on other datasets with a linear SVM. They remove the last classification layer and use 4096-D penultimate activations, aggregated across multiple locations and scales, then L2-normalized. On PASCAL VOC 2007 and 2012, their deeper Net-E and Net-D features achieve mean average precision around 89.3 on VOC-2007 and 89.0 on VOC-2012 validation, with the combined Net-D & Net-E improving slightly (89.7 on VOC-2007 and 89.3 on VOC-2012). On Caltech-101 and Caltech-256, Net-E improves over Net-D and their combination further improves performance; on Caltech-256 they report an 8.6% improvement over the prior best result in their comparison table. They also report state-of-the-art action classification on VOC-2012 using image-only features (79.2 mean AP) and improved performance when stacking image and provided person bounding box features (84.0 mean AP).

Limitations are not framed as formal statistical uncertainty (no confidence intervals or hypothesis tests are reported), and the comparisons are primarily controlled by architecture and training recipe rather than by randomized trials. The evaluation depends on substantial compute (training for weeks on multi-GPU systems) and on careful initialization and augmentation; thus, the results may not directly transfer to settings with different optimization budgets. Additionally, the paper notes that performance saturates at 19 weight layers on ImageNet classification, suggesting diminishing returns on that dataset, and it conjectures that deeper models might help on larger datasets. For localization, the method is tied to the specific ILSVRC-2014 evaluation protocol and uses particular merging heuristics.

Practically, the implications are clear for practitioners designing vision models: (1) depth is a reliable lever for accuracy in large-scale recognition when paired with small convolution kernels and ReLU non-linearities; (2) scale jittering during training and multi-scale/dense inference at test time materially improve accuracy; (3) ensembling a small number of strong deep models can yield large gains; and (4) the resulting representations transfer well to other tasks even without fine-tuning, enabling strong “off-the-shelf” feature pipelines. Who should care includes anyone building image recognition systems, researchers studying representation learning and transfer, and engineers seeking a principled baseline architecture that is both simple and state-of-the-art for its time. The public release of two best-performing models is intended to facilitate exactly this broader adoption.

Overall, the paper’s core contribution is the demonstration that a conventional ConvNet architecture can be made dramatically more accurate by systematically increasing depth to 16–19 weight layers using 3×3 convolutions, and that these deep representations generalize strongly beyond ImageNet.

Cornell Notes

The paper systematically evaluates how increasing ConvNet depth affects large-scale image recognition accuracy, using a consistent architecture with small 3×3 filters. It shows that pushing depth to 16–19 weight layers yields major improvements on ImageNet classification and localization, and that the learned representations transfer well to other datasets as fixed features.

What is the paper’s main research question and why does it matter?

How does convolutional network depth influence accuracy in large-scale image recognition when other architectural choices are held mostly constant? This matters because depth changes representational capacity and feature hierarchies, and it was unclear whether simply going deeper would reliably improve ImageNet-scale performance.

What study design is used to isolate the effect of depth?

The authors define a family of ConvNet architectures that differ primarily in depth (11 to 19 weight layers) while keeping the overall design principles fixed: 3×3 convolutions, ReLU, max-pooling schedule, and the same fully connected head.

What architectural strategy enables very deep networks without large receptive fields per layer?

They use small 3×3 convolution filters throughout. Stacking multiple 3×3 layers increases effective receptive field while adding more non-linearities and reducing parameters versus using a single larger filter.

How is the network trained for ImageNet classification (key hyperparameters)?

Mini-batch SGD with momentum (batch size 256, momentum 0.9), weight decay 5×10^-4, dropout 0.5 on the first two FC layers, learning rate starting at 10^-2 and reduced by 10× three times, stopping after 370K iterations (74 epochs). Data augmentation includes random crops, horizontal flips, and random RGB color shifts.

How do training scale strategies differ, and what is the effect?

They compare fixed smallest-side training (S=256 or 384) versus multi-scale training where S is jittered in [256, 512]. Scale jittering improves results substantially (e.g., for D: 27.0% top-1 at fixed Q=S=256 vs 25.6% top-1 when trained with [256,512] and tested at Q=384).

What is the primary classification result showing depth helps?

Top-1 validation error decreases with depth: A (11 layers) 29.6% → B (13) 28.7% → C (16 with 1×1) 28.1% → D (16 with 3×3) 27.0% → E (19) 27.3% at the single-scale setting. With scale jittering and multi-scale testing, the best models reach 24.8% top-1 and 7.5% top-5 error on validation.

What is the best single-model ImageNet performance reported?

Configuration E achieves 7.1% top-5 error on the ILSVRC test set when using multi-crop & dense evaluation (and 7.3% top-5 error for the 7-model submission ensemble).

How much do ensembling and combining dense + multi-crop help?

An ensemble of two models (D and E) reduces test top-5 error to 7.0% with dense evaluation and 6.8% with dense + multi-crop. Combining dense and multi-crop also improves single-model validation top-5 error (e.g., D: 7.5% dense vs 7.2% combined).

How do the learned representations generalize to other datasets?

Using fixed deep features (4096-D penultimate activations with multi-scale aggregation) and a linear SVM, the authors report strong results on VOC and Caltech. For example, on Caltech-256 their approach outperforms the prior best by 8.6% in their comparison table, and on VOC-2012 action classification they reach 79.2 mean AP using image-only features and 84.0 when stacking image and person bounding-box features.

Review Questions

Which parts of the architecture were held constant across depth variants, and which were allowed to change? Why is that important for causal interpretation?
Explain why stacking multiple 3×3 convolutions can outperform a single larger convolution in both parameter efficiency and non-linearity. What does the paper claim about this trade-off?
Summarize the evidence that scale jittering and multi-scale/dense evaluation are synergistic with depth. Use at least two numeric results from the tables.
What does the paper conclude about the role of 1×1 convolutions (configuration C) compared to using 3×3 convolutions throughout (configuration D)?
How do the transfer-learning experiments use the network (feature extraction, aggregation, classifier), and what datasets/metrics demonstrate generalization?

Key Points

1
The paper evaluates depth as a primary architectural variable for ConvNets on ImageNet, using a consistent design family with 3×3 convolutions and ReLU.
2
Increasing depth from 11 to 19 weight layers improves ImageNet classification accuracy, with best single-model validation performance of 24.8% top-1 and 7.5% top-5 error.
3
Scale jittering during training (S in [256,512]) and multi-scale testing (Q ∈ {256,384,512}) provide large additional gains beyond depth alone.
4
The best single-model test top-5 error is 7.1% (VGG E), while a two-model ensemble reduces it to 6.8% with dense + multi-crop evaluation.
5
Using very deep features as fixed representations transfers well: strong results on PASCAL VOC, Caltech-101/256, and VOC-2012 action classification with linear SVMs and no fine-tuning.
6
For localization, per-class bounding box regression (PCR) and fine-tuning all layers improve localization error, leading to a 25.3% top-5 localization error and a win in ILSVRC-2014 localization.
7
The authors argue that performance saturates around 19 weight layers on ImageNet classification, but may improve further on larger datasets.

Highlights

“a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.”

Best single-network validation performance: “24.8%/7.5% top-1/top-5 error” (configuration D/E with multi-scale evaluation).

Best test performance: “configuration E achieves 7.3% top-5 error” and after the submission “reduced the test error to 6.8%” using a 2-model ensemble with dense + multi-crop evaluation.

Localization: “With 25.3% test error, our ‘VGG’ team won the localisation challenge of ILSVRC-2014.”

Transfer learning: on Caltech-256, their method “outperform[s] the state of the art … by a large margin (8.6%).”

Topics

Computer vision
Deep learning
Convolutional neural networks
Image classification
Object localization
Transfer learning
Representation learning
Large-scale benchmarks (ImageNet, ILSVRC, PASCAL VOC, Caltech)

Mentioned

Caffe
Caffe toolbox (C++ Caffe)
ImageNet/ILSVRC evaluation server
NVIDIA Titan Black GPUs
ReLU
Dropout
Linear SVM
Karen Simonyan
Andrew Zisserman
Alex Krizhevsky
Ilya Sutskever
Geoffrey Hinton
Christian Szegedy
Sergey Ioffe
S. Ren
Jian Sun
Kaiming He
Ross Girshick
J. Donahue
Trevor Darrell
Jitendra Malik
Matthew Zeiler
Rob Fergus
Sergey Zagoruyko (not explicitly listed; omitted)
ILSVRC - ImageNet Large-Scale Visual Recognition Challenge
ImageNet - Large-scale image dataset used for recognition benchmarks
ConvNet - Convolutional neural network
FC - Fully connected layer
LRN - Local Response Normalization
SGD - Stochastic Gradient Descent
mAP - mean Average Precision
SVM - Support Vector Machine
PCR - Per-class regression (bounding box regression)
SCR - Single-class regression (bounding box regression)
AP - Average Precision
VOC - PASCAL Visual Object Classes
RGB - Red Green Blue
ReLU - Rectified Linear Unit
top-1 error - fraction of samples whose correct class is not the top prediction
top-5 error - fraction of samples whose correct class is not in the top five predictions