Lecture 1: Introduction to Deep Learning - Full Stack Deep Learning

TL;DR

Deep learning’s 2012 shift replaced hand-crafted image features with learned representations trained end-to-end from labeled data.

Briefing Cornell Notes

Briefing

Deep learning’s breakthrough in 2012 wasn’t just a better model—it replaced hand-crafted image features with learned representations, turning “what to look for” into something the system figures out from data. The core shift was moving from engineered pipelines (like edge histograms fed into support vector machines) to training large neural networks end-to-end, where millions of parameters encode the decision logic. Instead of designing invariances such as “edges stay put under lighting changes,” researchers learned those invariances directly by optimizing weights using labeled examples.

The lecture uses the ImageNet era to show why this mattered. In 2010 and 2011, traditional computer vision approaches improved slowly, with error rates around 30% for the winning system. In 2012, a neural network with roughly 60 million parameters produced a major jump and then accelerated progress in subsequent years. The explanation offered is practical as much as theoretical: neural networks can approximate complex functions, but they need large datasets to determine which function to learn. With enough labeled images (ImageNet’s scale is described as over a million), training becomes feasible, and the learned weights act like a program discovered through data rather than written by hand.

That same “learn input-to-output mappings” idea extends beyond image classification. Neural networks can be trained to generate captions from images using datasets that pair pictures with text, and multiple research groups reportedly achieved similar results quickly once the data and expertise were in place. The lecture then pushes the argument further: captioning can be seen as pattern matching, but question answering about an image is closer to understanding. A dataset pairing images, questions, and answers demonstrates this capability with examples like identifying vegetables and counting school buses.

Applications also moved from lab benchmarks into domains where humans historically required extensive training. In medicine, the lecture points to neural networks matching long-established diagnostic workflows in areas like radiology and cancer screening, and notes that by April 2018 an FDA-approved neural network could support decisions for diabetes-related retinal referrals. Autonomous driving is framed as another major frontier, with many companies pursuing the market, though outcomes remain uncertain.

Why did deep learning suddenly surge after decades of earlier neural network work? Three drivers are highlighted: abundant data, more compute, and training methods that made large models practical. Backpropagation and convolutional architectures existed earlier, but the lecture argues that scaling data and compute changed what was achievable. Training runs are described as expensive—one strong example took about six days on state-of-the-art GPUs at the time—and progress depends on repeated experiments across hyperparameters and architectures.

The session closes by addressing skepticism around robustness, especially adversarial examples. The lecture’s framing is that random noise rarely breaks models, but carefully optimized small perturbations can flip predictions because high-dimensional spaces contain directions that change outputs with minimal pixel changes. Defenses include adversarial training (adding adversarially generated examples into the training set) and other robustness techniques, with progress ongoing rather than “solved.” It also notes that adversarial perturbations can be generated automatically by optimizing pixels to maximize a target class, effectively running the same gradient-based machinery used in training—just applied to the input instead of the weights.

Cornell Notes

The lecture argues that deep learning’s 2012 leap came from learning representations directly from data instead of hand-crafting features for images. Large neural networks (tens of millions of parameters) can approximate complex functions, but they require massive labeled datasets to learn the right mapping from inputs to outputs. The same approach generalizes beyond classification to captioning and image question answering, and it has moved into high-stakes areas like medical decision support, including an FDA-approved neural network for retinal referral decisions. Scaling data and compute—along with practical training via backpropagation—helped make these models trainable and improved performance rapidly. Robustness remains an active challenge, with adversarial examples showing that carefully targeted small perturbations can fool models even when random noise usually won’t.

What changed in 2012 that made deep learning outperform traditional computer vision pipelines?

The shift was from hand-designed image representations (e.g., edge histograms intended to capture invariances) plus a simpler classifier like a support vector machine, to end-to-end training of a large neural network that learns its own representation. In the lecture’s framing, the network’s millions of parameters function like a learned program: weights are set by training on labeled images, and backpropagation adjusts them so the model’s predictions improve across many examples.

Why does the lecture connect neural network success to data scale?

Neural networks are described as universal function approximators, meaning they can represent many functions when large enough. But that flexibility creates an identification problem: with limited data, it’s hard to know which function to learn. Large datasets (ImageNet is cited as over a million images) narrow down the mapping the network should approximate, enabling the learned representation to generalize.

How does the lecture extend the argument beyond image classification?

It describes training neural networks on image-caption datasets so the output is text (about 200,000 possible words), and it notes that multiple research groups reportedly achieved similar results quickly once the dataset existed. It then moves to image question answering using datasets that pair images, questions, and answers, arguing that answering arbitrary questions is a stronger test than captioning.

What real-world domains are highlighted as benefiting from deep learning?

Medicine is emphasized: neural networks can match diagnostic tasks that previously required long training, such as detecting potential cancerous findings and interpreting radiology images. The lecture also notes an FDA-approved neural network (as of April 2018) that supports a diabetes-related retinal decision—whether a patient should be referred to specialists based on retinal images.

What are adversarial examples, and why do they work even when random noise doesn’t?

The lecture distinguishes random noise from targeted perturbations. Random small changes are unlikely to flip a model’s decision, but adversarial examples are produced by solving an optimization problem: find the smallest change to an input that causes the network to output the wrong class. Because the input space is high-dimensional and the model learned decision boundaries, there are directions where tiny changes can cross into a different class region.

How are adversarial examples generated and how do defenses respond?

Generation is described as gradient-based optimization: run the model forward to see the current classification, then adjust the input pixels to maximize the target class output—using backprop-like calculations, but applied to the pixels rather than the weights. Defenses include adversarial training: generate adversarial examples for training images and include them as labeled examples so the model learns to classify them correctly.

Review Questions

What role do learned weights play in replacing hand-crafted image features, and how does training determine those weights?
Why does the lecture claim that scaling data and compute after 2012 was essential rather than optional?
How do adversarial examples differ from random noise, and what defense strategy is described to improve robustness?

Key Points

1
Deep learning’s 2012 shift replaced hand-crafted image features with learned representations trained end-to-end from labeled data.
2
Neural network weights act like a discovered program; backpropagation adjusts millions of parameters to reduce prediction errors across large datasets.
3
ImageNet results illustrate a major performance jump in 2012 followed by faster progress, attributed to representation learning at scale.
4
The input-to-output learning paradigm generalizes from classification to captioning and image question answering using paired datasets.
5
Medical applications are moving from benchmarks to regulated decision support, including an FDA-approved neural network for retinal referral decisions (as of April 2018).
6
Deep learning’s surge is linked to three scaling factors: more data, more compute, and training methods that made large models practical.
7
Adversarial examples exploit carefully optimized small perturbations; defenses such as adversarial training incorporate those perturbations into training to improve robustness.

Highlights

The lecture frames the 2012 breakthrough as learning invariances directly from data, not designing them by hand.

Captioning is treated as a stepping stone; question answering is presented as a stronger test of whether a model truly grasps image content.

An FDA-approved neural network (April 2018) is cited as enabling diabetes-related retinal referral decisions.

Adversarial examples work through targeted optimization in high-dimensional space, and adversarial training is offered as a key mitigation strategy.

Topics

Deep Learning Foundations
ImageNet Breakthrough
Representation Learning
Adversarial Examples
Medical AI

Mentioned

Peter Erbil
Sergey
Josh Tobin
Jeff Hinton
Lucas B. Weld
Raquel
Jeremy Howard
Richard Socher
Alex Ned
Sergey

Lecture 1: Introduction to Deep Learning - Full Stack Deep Learning - March 2019