Lecture 2A: Convolutional Neural Networks (Full Stack Deep Learning

TL;DR

Fully connected image models flatten pixels into a large vector, causing weight counts to grow rapidly with image resolution.

Briefing Cornell Notes

Briefing

Convolutional neural networks gained their edge in computer vision by replacing the “flatten an image and learn a giant matrix” approach with a sliding, weight-sharing operation that scales far better as images get larger. Fully connected networks treat every pixel as a distinct input feature, so the number of learnable weights grows rapidly with image size: a 32×32 grayscale image flattens to 1024 values, but moving to 64×64 multiplies the input dimensionality by four, and 128×128 multiplies it by sixteen—ballooning the parameter count. Convnets also address a second weakness: fully connected models are not naturally invariant to translations. If an object shifts a few pixels, the fully connected layer effectively “looks at” different pixel positions, so the model’s outputs can change unless translation robustness is engineered via augmentation.

A convolutional filter fixes both issues by operating on local patches and reusing the same learned weights across the image. Instead of multiplying a whole-image vector by a massive matrix, a conv layer extracts a small window—commonly 5×5—flattens it, and computes a dot product with a learned weight vector to produce one output value for that patch. Sliding the window across height and width yields an output map with one value per patch location. Historically, carefully chosen convolution kernels could perform interpretable image processing like blurring; modern convnets learn those weights directly from data.

The core operation extends naturally to color images and multiple feature maps. For RGB inputs shaped 32×32×3, a 5×5×3 patch contains 75 values, so each filter learns 75 weights. Applying multiple filters produces multiple channels in the output tensor: for example, 10 filters can turn a 28×28×3 input into a 28×28×10 output. Crucially, the output has the same three-dimensional tensor structure as the input, enabling stacking: one convolutional layer’s output can feed the next. Because each convolution is linear, convnets typically insert a nonlinearity (often ReLU) after each convolution to increase expressiveness.

Two practical knobs control how the filter moves and how tensor sizes change. Stride determines the step size of the sliding window; larger strides skip positions and downsample the feature map. Padding adds a border around the input (often zeros) so the filter can still be applied near edges; “same” padding is designed to keep output spatial dimensions equal to input, while “valid” uses no padding. Output sizes follow standard arithmetic based on input size, filter size, stride, and padding.

Beyond basic convolution, the lecture highlights operations that shape what the network can “see” and how it compresses information. Stacking convolutions increases the receptive field: two 3×3 layers can match the spatial coverage of a single 5×5, often with better empirical performance due to added nonlinearity. Dilated convolutions expand receptive field without increasing parameters by skipping pixels inside the kernel footprint. To reduce spatial dimensions, pooling—especially 2×2 max pooling—summarizes local regions by taking max (or sometimes average). A 1×1 convolution reduces channel depth while mixing information across channels at each spatial location.

As a baseline architecture, the lecture describes the classic LeNet-style pattern: repeated blocks of convolution + nonlinearity, followed by pooling, then fully connected layers and a softmax output. The discussion also clarifies training pragmatics, such as placing activations between fully connected layers but avoiding a nonlinearity after the final fully connected layer when the loss function expects raw logits.

Cornell Notes

Convolutional neural networks replace fully connected image processing with local, sliding filters that reuse the same weights across the image. This avoids the rapid parameter growth of fully connected layers and improves translation robustness because the same detector responds wherever the pattern appears. Convnets stack convolution layers (with nonlinearities) to build richer features, while stride and padding control downsampling and output sizes. Receptive field expands through stacking or dilated convolutions, and pooling or 1×1 convolutions reduce spatial size or channel count. A LeNet-style architecture—conv + activation + pooling repeated, then fully connected layers and softmax—serves as a classic baseline.

Why do fully connected networks scale poorly for images, and how does convolution address that?

Fully connected image models flatten an image into a long vector and multiply by a large weight matrix. As image resolution increases, the input dimensionality—and thus the number of weights—grows quickly (e.g., 32×32 → 1024 inputs; 64×64 increases inputs by 4×; 128×128 increases by 16×). Convolution instead applies a small filter to local patches and reuses the same learned weights across all spatial locations, so parameter count depends on filter size and number of filters rather than the full image area. It also improves translation robustness because the same pattern detector runs across the image rather than binding each output to fixed pixel positions.

How does a convolutional filter produce an output map from an image?

A filter (e.g., 5×5) is flattened and dotted with each extracted 5×5 patch as the window slides across height and width. Each patch yields one output value, so sliding across the image produces a 2D output map (one value per patch location). With multiple filters, the output gains a channel dimension: for RGB inputs, a 5×5×3 patch becomes 75 values, and each filter learns 75 weights; 10 filters produce 10 output channels.

What roles do stride and padding play in convolutional layers?

Stride controls how far the filter moves each step. Stride 1 moves one pixel at a time; larger strides skip positions and downsample the feature map. Padding adds a border around the input (commonly zeros) so the filter can still be applied at edges. “Same” padding is chosen so output spatial dimensions match input; “valid” uses no padding, shrinking the output.

How can receptive field grow without using larger kernels?

Receptive field expands when layers stack. Two 3×3 convolutions can cover the same 5×5 area as a single 5×5 convolution, and the extra nonlinearity often improves performance. Dilated convolutions increase receptive field by skipping pixels inside the kernel footprint, allowing larger coverage without increasing parameter count. Stacking dilated layers can make receptive field grow quickly.

What are pooling and 1×1 convolutions used for?

Pooling reduces spatial resolution. The common choice is 2×2 max pooling, which replaces each 2×2 region with its maximum value (average pooling exists too). 1×1 convolutions reduce channel depth: they apply a learned linear combination across channels at each spatial location while using a receptive field of 1×1, effectively mixing channel information independently per pixel.

What is the LeNet-style convolutional architecture pattern described here?

The baseline pattern repeats convolution + nonlinearity (often ReLU) and pooling multiple times to shrink the spatial dimensions. After enough conv/pool blocks, the network uses fully connected layers to produce class scores, then applies softmax for the final output. The lecture’s example uses MNIST-sized inputs (32×32), small 5×5 convolutions, a small number of filters, pooling, and two fully connected layers.

Review Questions

How does parameter count in a fully connected image model change as image resolution increases, and why does weight sharing in convolution prevent the same scaling problem?
Given a convolution with a certain filter size, stride, and padding, what determines the output spatial dimensions, and how do “same” and “valid” differ?
Why might two stacked 3×3 convolutions outperform a single 5×5 convolution even when they cover the same receptive field?

Key Points

1
Fully connected image models flatten pixels into a large vector, causing weight counts to grow rapidly with image resolution.
2
Convolutional filters operate on local patches and reuse the same weights across spatial locations, improving both scalability and translation robustness.
3
Stacking convolution layers (with nonlinearities) builds increasingly complex features because each layer’s output retains a tensor structure suitable for further convolutions.
4
Stride down-samples by skipping filter positions, while padding (often zeros) prevents edge information from being discarded; “same” padding targets equal input/output spatial sizes.
5
Receptive field expands through stacking or dilated convolutions; dilations enlarge coverage without adding parameters by skipping pixels within the kernel footprint.
6
Pooling (commonly 2×2 max pooling) reduces spatial dimensions, while 1×1 convolutions reduce channel depth by mixing information at each pixel location.
7
A LeNet-style baseline alternates conv + activation and pooling blocks, then uses fully connected layers and softmax for classification.

Highlights

Fully connected networks bind each output to fixed pixel positions, so they lack natural translation invariance without extra measures like augmentation.

A 5×5×3 convolution on RGB inputs learns 75 weights per filter and produces multiple output channels when multiple filters are used.

“Same” padding is designed so output spatial dimensions match input, while “valid” padding shrinks outputs by avoiding border padding.

Two 3×3 convolutions can match the receptive field of a 5×5 convolution, often performing better due to added nonlinearity.

Dilated convolutions expand receptive field by skipping pixels, increasing coverage without increasing parameter count.

Topics

Convolutional Filters
Stride And Padding
Receptive Field
Dilated Convolutions
LeNet Architecture

Lecture 2A: Convolutional Neural Networks (Full Stack Deep Learning - Spring 2021)