Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Q: How does the encoder represent the image for attention?

It uses a pretrained Oxford VGG feature map of size 14$\times$14$\times$512, flattened to $L=196$ spatial locations $\mathbf{a}_i\in\mathbb{R}^{512}$.

Q: How is soft (deterministic) attention computed and trained?

Soft attention computes $\hat{\mathbf{z}}_t=\sum_i \alpha_{t,i}\mathbf{a}_i$ using $\alpha_{t,i}$ from a softmax over attention scores; training is end-to-end via standard backpropagation on a penalized negative log-likelihood with a doubly-stochastic regularizer.

Q: How is hard (stochastic) attention computed and trained?

Hard attention samples a single location $s_t$ from a multinoulli distribution parameterized by $\alpha_{t,i}$, making $\hat{\mathbf{z}}_t$ a random variable; parameters are learned by maximizing a variational lower bound using Monte Carlo gradient estimates (equivalent to REINFORCE), with variance reduction via a moving-average baseline, entropy regularization, and a 0.5 probability switch to use expected attention.

Q: What interpretability evidence do the authors provide?

They visualize attention weights over the 14$\times$14 grid (upsampled and smoothed) and show that the model’s focus often matches human intuition about which regions matter for each generated word.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio

arXiv (Cornell University)·2015·Computer Science·7,511 citations

7 min read

Read the full paper at DOI or on arxiv

TL;DR

The paper proposes an attention-based image captioning model that learns word-by-word alignment between generated text and spatial CNN features.

Briefing Cornell Notes

Briefing

This paper asks whether adding an explicit visual attention mechanism to neural image caption generation improves both caption quality and interpretability. The problem matters because captioning requires more than mapping an image to a single fixed embedding: it must decide which parts of the image are relevant at each word position and express relationships in natural language. Prior “show and tell” style models typically encode an image once (e.g., using the top layer of a CNN) and then decode a sentence with an RNN, which can force the model to compress all visual information into a static vector. That compression can lose fine-grained details needed for richer captions, especially in cluttered scenes.

The authors’ contribution is an encoder–decoder captioning framework with attention over spatial CNN features. They propose two attention variants under a shared architecture: (1) a deterministic “soft” attention mechanism that computes a weighted average of image annotation vectors using attention weights $α_{t, i} \overset{˘}{0} 007$ at each decoding step, enabling end-to-end training with standard backpropagation; and (2) a stochastic “hard” attention mechanism that samples a single spatial location at each time step (a latent variable), trained by maximizing a variational lower bound using Monte Carlo gradient estimates, which is equivalent to REINFORCE-style learning.

Methodologically, the encoder extracts a set of spatial “annotation vectors” from a convolutional feature map. Concretely, they use Oxford VGG features pretrained on ImageNet without fine-tuning, taking the 14 $\times$ 14 $\times$ 512 feature map before max pooling. Flattening yields $L = 196$ spatial locations, each represented by a $D = 512$ -dimensional vector $a_{i} \in R^{D}$ . The decoder is an LSTM that generates a caption word-by-word. At each time step $t$ , the LSTM produces a hidden state $h_{t - 1}$ , which conditions an attention network $f_{att}$ to score each spatial location: $e_{t, i} = f_{att} (a_{i}, h_{t - 1})$ , followed by a softmax to obtain normalized attention weights $α_{t, i} = \frac{e x p ( e _{t, i} )}{\sum _{k = 1}^{L} e x p ( e _{t, k} )}$ . The context vector $\hat{z}_{t}$ is then computed either as a weighted sum (soft attention) or as a sampled feature vector (hard attention). The output layer combines the previous word embedding, the LSTM hidden state, and the context vector to produce next-word probabilities.

For soft attention, the context vector is deterministic: $\hat{z}_{t} = \sum_{i = 1}^{L} α_{t, i} a_{i}$ . Training minimizes a penalized negative log-likelihood: $L_{d} = - lo g P (y ∣ x) + λ \sum_{i}^{L} (1 - \sum_{t}^{C} α_{t, i})^{2}$ . The additional term encourages “doubly stochastic” behavior so that attention mass covers all image regions across the generated sequence (i.e., $\sum_{t} α_{t, i} \approx 1$ for each location). They also introduce a gating scalar $β_{t}$ (computed from $h_{t - 1}$ ) that scales the context vector, $ϕ ({a_{i}}, {α_{i}}) = β \sum_{i} α_{i} a_{i}$ , to emphasize salient objects.

For hard attention, they treat the attended location as a one-hot latent variable $s_{t}$ drawn from a multinoulli distribution parameterized by $α_{t, i}$ . The context becomes a random variable $\hat{z}_{t} = \sum_{i} s_{t, i} a_{i}$ . They optimize a variational lower bound $L_{s}$ on $lo g p (y ∣ a)$ , approximating gradients via Monte Carlo sampling of $s_{t}$ . To reduce variance, they use a moving-average baseline $b_{k} = 0.9 b_{k - 1} + 0.1 lo g p (y ∣ \tilde{s}_{k}, a)$ , add an entropy regularizer on the attention distribution, and use a hybrid strategy where with probability 0.5 they replace the sampled location with its expected value $α$ . They report that this hard attention formulation is equivalent to REINFORCE where the reward is proportional to the log-likelihood of the target sentence under the sampled attention trajectory.

The experiments evaluate caption quality on Flickr8k (8,000 images), Flickr30k (30,000 images), and MS COCO (82,783 images). The datasets provide multiple reference captions per image (5 for Flickr; COCO has variable counts, and they discard extra references for consistency). They use a fixed vocabulary size of 10,000. They report BLEU-1 through BLEU-4 (without brevity penalty) and METEOR, and compare against prior methods under comparable feature extraction settings (noting that some baselines use AlexNet while their main comparisons use GoogLeNet/Oxford VGG features).

The key quantitative results are summarized in their Table 1. On Flickr8k, their soft-attention model achieves BLEU-1 = 63, BLEU-2 = 41, BLEU-3 = 27, BLEU-4 = 17.7, and METEOR = 27.7 (the table also lists hard-attention values of BLEU-1 = 65.6, BLEU-2 = 42.4, BLEU-3 = 29.9, BLEU-4 = 19.5, and METEOR = 31.4). On Flickr30k, soft attention yields BLEU-1 = 66.3, BLEU-2 = 42.3, BLEU-3 = 27.7, BLEU-4 = 18.3, and METEOR = 17.1, while hard attention yields BLEU-1 = 66.9, BLEU-2 = 43.9, BLEU-3 = 28.8, BLEU-4 = 19.1, and METEOR = 19.9. On MS COCO, the reported soft-attention results include BLEU-1 = 64.2, BLEU-2 = 45.1, BLEU-3 = 30.4, BLEU-4 = 20.3, and METEOR = 20.41; hard attention reports BLEU-1 = 66.6, BLEU-2 = 46.1, BLEU-3 = 32.9, BLEU-4 = 23.04, and METEOR = 25.0 (the table shows multiple baseline rows and some missing metrics).

Overall, the authors claim state-of-the-art performance on all three datasets and emphasize that their approach uses a single model (not an ensemble) while still outperforming prior systems. They also note that they improve METEOR on MS COCO, attributing this partly to their regularization and the use of lower-level convolutional features that preserve more information than a single high-level vector.

Beyond numbers, the paper provides qualitative evidence by visualizing attention weights over the 14 $\times$ 14 spatial grid (upsampled and smoothed). They argue that the learned alignments correspond well to human intuition and that the hard/soft attention mechanisms can attend to non-object salient regions, unlike some prior attention methods that rely on object detectors to define candidate alignment targets. They further suggest that attention visualizations can help diagnose errors by revealing where the model “looked” when generating a word.

Limitations are not exhaustively quantified in the provided text, but several constraints are apparent from the methodology. First, the encoder is fixed to pretrained VGG features without fine-tuning, which may limit ceiling performance relative to fully end-to-end vision-language training. Second, hard attention relies on Monte Carlo sampling and variance-reduction tricks, which can be less stable and more expensive than soft attention; the paper does not provide a detailed compute comparison beyond noting training time for soft attention on COCO (less than 3 days on an NVIDIA Titan Black GPU). Third, evaluation relies on BLEU and METEOR, which are known to have shortcomings for semantic faithfulness; the authors acknowledge BLEU criticism by including METEOR.

Practically, the results matter for anyone building image captioning systems or other vision-language models that require word-by-word grounding. The work provides a clear recipe for implementing attention over spatial CNN features, demonstrates that attention improves caption metrics on major benchmarks, and offers interpretability through attention maps. Researchers and practitioners should care because the approach is modular (encoder-decoder with attention) and can be adapted to other tasks where dynamic selection of visual evidence is needed, such as visual question answering, referring expression generation, and multimodal translation.

In sum, the paper shows that learning to attend over spatial features—via differentiable soft attention or variationally trained hard attention—substantially improves neural image caption generation quality and yields interpretable alignments, establishing attention as a core component of modern captioning systems.

Cornell Notes

The paper introduces an attention-based neural image captioning model that learns to focus on different spatial regions of an image while generating each word. It presents both deterministic soft attention (trainable with backprop) and stochastic hard attention (trainable via a variational lower bound/REINFORCE), and validates the approach on Flickr8k, Flickr30k, and MS COCO with state-of-the-art BLEU and METEOR, plus interpretable attention visualizations.

What is the main research question of the paper?

Can adding a learned visual attention mechanism to neural image caption generation improve caption quality and provide interpretability compared with using a single static image representation?

Why does attention matter for caption generation?

Captioning requires selecting relevant visual evidence at each word position; attention allows the model to dynamically focus on salient image regions rather than compressing everything into one fixed vector.

What is the shared encoder–decoder framework used in both attention variants?

A CNN encoder extracts spatial annotation vectors from a convolutional feature map, and an LSTM decoder generates words sequentially, conditioning each step on an attention-derived context vector.

How does the encoder represent the image for attention?

It uses a pretrained Oxford VGG feature map of size 14 $\times$ 14 $\times$ 512, flattened to $L = 196$ spatial locations $a_{i} \in R^{512}$ .

How is soft (deterministic) attention computed and trained?

Soft attention computes $\hat{z}_{t} = \sum_{i} α_{t, i} a_{i}$ using $α_{t, i}$ from a softmax over attention scores; training is end-to-end via standard backpropagation on a penalized negative log-likelihood with a doubly-stochastic regularizer.

How is hard (stochastic) attention computed and trained?

Hard attention samples a single location $s_{t}$ from a multinoulli distribution parameterized by $α_{t, i}$ , making $\hat{z}_{t}$ a random variable; parameters are learned by maximizing a variational lower bound using Monte Carlo gradient estimates (equivalent to REINFORCE), with variance reduction via a moving-average baseline, entropy regularization, and a 0.5 probability switch to use expected attention.

What datasets and evaluation metrics are used?

Flickr8k (8,000 images), Flickr30k (30,000), and MS COCO (82,783); evaluation uses BLEU-1..4 (no brevity penalty) and METEOR.

What are the reported results on Flickr8k?

Soft attention: BLEU-1 = 63, BLEU-2 = 41, BLEU-3 = 27, BLEU-4 = 17.7, METEOR = 27.7; Hard attention: BLEU-1 = 65.6, BLEU-2 = 42.4, BLEU-3 = 29.9, BLEU-4 = 19.5, METEOR = 31.4 (as shown in their Table 1).

What interpretability evidence do the authors provide?

They visualize attention weights over the 14 $\times$ 14 grid (upsampled and smoothed) and show that the model’s focus often matches human intuition about which regions matter for each generated word.

Review Questions

How do the authors formulate attention as a latent variable in hard attention, and what objective is optimized to train it?
What role does the doubly-stochastic regularization term $λ \sum_{i} (1 - \sum_{t} α_{t, i})^{2}$ play in soft attention training?
Compare the computational/training implications of soft vs hard attention (differentiability vs sampling/variance reduction).
What specific architectural choice enables spatial attention (which CNN layer and how are features reshaped)?
How do the reported BLEU and METEOR improvements support the claim that attention improves caption generation beyond a static image embedding?

Key Points

1
The paper proposes an attention-based image captioning model that learns word-by-word alignment between generated text and spatial CNN features.
2
It introduces two attention mechanisms: deterministic soft attention (end-to-end differentiable) and stochastic hard attention (variational lower bound / REINFORCE).
3
The encoder uses lower-level convolutional features (VGG 14 $\times$ 14 $\times$ 512 flattened to $L = 196$ vectors) so the decoder can selectively focus on image regions.
4
Soft attention training includes a doubly-stochastic regularizer encouraging attention coverage across the image over the whole caption sequence.
5
Hard attention uses Monte Carlo sampling of attended locations with variance reduction (moving-average baseline), entropy regularization, and a hybrid sampling/expected-value strategy.
6
On Flickr8k, Flickr30k, and MS COCO, the attention models achieve state-of-the-art captioning performance in BLEU and METEOR using a single model (no ensemble).
7
The learned attention weights can be visualized to provide interpretability, and the model can attend to non-object salient regions without relying on external object detectors.

Highlights

“We introduce an attention based model that automatically learns to describe the content of images.”

Soft attention on Flickr8k reports BLEU-4 = 17.7 and METEOR = 27.7 (hard attention: BLEU-4 = 19.5 and METEOR = 31.4).

Hard attention is “equivalent to REINFORCE” where the reward is proportional to the log likelihood of the target sentence under the sampled attention trajectory.

The model learns alignments that “correspond very strongly with human intuition,” and can attend to “non object” salient regions.

Topics

Computer Vision
Image Captioning
Multimodal Machine Learning
Neural Sequence Models
Attention Mechanisms
Reinforcement Learning for Neural Networks
Natural Language Generation
Encoder–Decoder Architectures

Mentioned

Oxford VGGNet
ImageNet
Theano
Whetlab
RMSProp
Adam
NVIDIA Titan Black
REINFORCE
Kelvin Xu
Jimmy Lei Ba
Ryan Kiros
Kyunghyun Cho
Aaron Courville
Ruslan Salakhutdinov
Richard S. Zemel
Yoshua Bengio
Dzmitry Bahdanau
Kyunghyun Cho
Diederik P. Kingma
Jimmy Lei Ba
Ronald J. Williams
Nitish Srivastava
Relu Patrascu
LSTM - Long Short-Term Memory
CNN - Convolutional Neural Network
BLEU - Bilingual Evaluation Understudy
METEOR - Metric for Evaluation of Translation with Explicit ORdering
RNN - Recurrent Neural Network
VGG - Visual Geometry Group
COCO - Common Objects in Context
REINFORCE - A policy-gradient method for reinforcement learning
MLP - Multi-Layer Perceptron
SGD - Stochastic Gradient Descent
GPU - Graphics Processing Unit
VAE - Variational Autoencoder (related concept; paper uses variational lower bound rather than a full VAE)