Get AI summaries of any video or article — Sign up free
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021) thumbnail

Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)

The Full Stack·
7 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

arXiv’s machine learning paper volume is high enough that individuals need curation strategies; reading everything is not feasible.

Briefing

Deep learning research is shifting from “interesting ideas” to “rapidly deployable tools,” and the lecture’s through-line is that the fastest progress now comes from learning setups that reduce human labeling and from scaling compute and data—often in ways that let one model transfer across many tasks. The talk starts by quantifying the problem: arXiv posts thousands of machine learning and AI papers per month, with the curve still rising, making it impossible for any individual to read everything. That flood forces a strategy: sample research directions, look for shared themes, and use high-bandwidth learning methods like tutorials and curated paper lists.

The first major research direction is unsupervised learning—once mostly a research-only lane, now increasingly practical. The lecture contrasts supervised learning’s dependence on labeled data with two ways to loosen that constraint. In semi-supervised learning, labels propagate from a small set of annotated examples into nearby unlabeled points (the intuition is “close points share labels”), and the noisy student approach improves results by training a teacher on labeled data, generating pseudo-labels for unlabeled data, then retraining a student on those pseudo-labels while injecting additional noise via dropout, data augmentation, and stochastic depth. A key limitation is distribution mismatch: pseudo-labeling assumes unlabeled data comes from roughly the same distribution as the labeled set.

To remove that assumption, the talk highlights unsupervised pretraining with a shared trunk and two heads: one head learns an auxiliary task without human annotation (e.g., next-token prediction, denoising, or grayscale-to-color), while the other head is fine-tuned on the task that actually matters. This framing is used to explain why large language models became a default for NLP. GPT-2 is presented as a landmark: trained on massive text to predict the next token, then fine-tuned on supervised benchmarks, it achieved new state-of-the-art results across many tasks with a single general model. Scaling up parameters consistently improved performance, and the lecture links this to a broader industry push for more compute—citing that research funding helped drive large-scale training efforts and that GPT-3 outperformed GPT-2.

The same “pretrain then adapt” logic is extended to vision. Because predicting raw pixels is combinatorially hard, vision researchers use proxy tasks: jigsaw puzzles, rotation prediction, and—most prominently—contrastive learning (SimCLR and MoCo). In contrastive learning, two augmented views of the same image are treated as positives while other images act as negatives; fine-tuning can then use only a linear classifier. The lecture emphasizes that, with enough scale, unsupervised vision features can match fully supervised performance and often improve with larger models.

Next comes reinforcement learning, where an agent acts in an environment and learns from delayed reward—raising two central challenges: credit assignment (which actions caused success) and stability during trial-and-error exploration. The lecture surveys major milestones (Atari with DeepMind’s approach, AlphaGo, AlphaGo Zero/AlphaZero) and robotics successes, including learned locomotion and manipulation. It then tackles a practical bottleneck: learning from pixels is far slower than learning from underlying state. A proposed bridge uses contrastive unsupervised learning inside RL pipelines to recover state-like representations from images, enabling image-based RL to match state-based learning in many settings.

Finally, the lecture broadens beyond standard ML tasks into robotics and science. In robotics, it highlights meta-learning for faster adaptation, imitation learning for stronger supervision signals, and—especially—simulation-to-real transfer via domain randomization and domain adaptation. The lecture argues that training in simulation with many randomized variations can generalize surprisingly well, even when the simulator is not realistic. In science and engineering, it points to DeepMind’s AlphaFold and AlphaFold 2 as headline examples of deep learning predicting protein structure from sequence, plus related ideas like speeding design with learned surrogates and using generative models for physics and differential equations.

The closing advice is pragmatic: keep up without reading thousands of papers by relying on conference tutorials, graduate courses, and curated explainers (YouTube channels, newsletters, and “archive sanity” style tools). When reading papers, prioritize those referenced by tutorials or highly discussed via saves and hype, use skim-first strategies, and consider forming reading groups. The lecture ends with a career perspective: a PhD is no longer required to do impactful AI work; it’s mainly justified for those aiming to become deep technical experts who build new tools rather than apply existing ones.

Cornell Notes

The lecture argues that AI progress increasingly depends on scalable learning strategies that reduce human labeling and on using more data and compute to make general pretrained models transferable. Unsupervised and self-supervised learning are highlighted as the fastest-moving research-to-practice area: semi-supervised label propagation and “noisy student” improve accuracy, while modern unsupervised pretraining (e.g., next-token prediction in GPT-2) enables one model to be fine-tuned across many supervised tasks. Vision adapts the same idea using proxy objectives and contrastive learning (SimCLR, MoCo), where fine-tuning can approach fully supervised results. Reinforcement learning is framed around delayed reward and credit assignment, with a key bottleneck that pixel-based learning is much slower than state-based learning; contrastive representation learning can close much of that gap. The talk concludes with practical ways to keep up—tutorials, courses, curated explainers—and with career guidance on when a PhD is worth it.

Why did unsupervised learning shift from “pure research” to something that quickly becomes practice?

The lecture ties the shift to methods that make unlabeled data useful at scale. Semi-supervised learning propagates labels from a small labeled set into nearby unlabeled examples, then retrains with noise injection (the noisy student approach). Unsupervised learning goes further by training a shared network trunk on an auxiliary task that needs no human annotation—like next-token prediction, denoising, or grayscale-to-color—and then fine-tuning a supervised head with a small labeled dataset. This reduces reliance on expensive annotation and leverages the massive volume of raw data available on the internet.

How does the noisy student method work, and what intuition connects it to label propagation?

A teacher model is trained on labeled data, then used to generate pseudo-labels for unlabeled data. Those pseudo-labels aren’t perfect, so the method relies on confidence: the model is more reliable near labeled examples and less reliable elsewhere. The student model is retrained using the pseudo-labeled data, with additional regularization noise such as dropout, data augmentation, and stochastic depth. The lecture’s toy example mirrors this: labels “grow out” from labeled points through dense regions while avoiding propagation across sparse or ambiguous gaps.

What is the key assumption behind semi-supervised label propagation, and why does unsupervised pretraining avoid it?

Semi-supervised pseudo-labeling assumes the unlabeled data comes from roughly the same distribution as the labeled data; otherwise, pseudo-labels can become meaningless (e.g., mixing animal images with unrelated categories like people or street scenes). Unsupervised pretraining avoids this specific assumption by learning general representations from the structure of the data itself (predicting next tokens, reconstructing masked content, etc.) and then adapting to the target task via fine-tuning.

Why does contrastive learning matter for vision, and what does it train the model to do?

Vision proxy tasks like predicting rotations or jigsaw order can teach useful features, but contrastive learning is emphasized as a widely used approach. Using SimCLR-style or MoCo-style setups, the training pipeline creates two augmented views of the same image (e.g., grayscale vs. color, crops) and treats them as positives that should map close together in embedding space. Different images act as negatives and are pushed apart. After this unsupervised training, fine-tuning can be as simple as training a linear classifier on top, often matching fully supervised performance when scaled.

What makes reinforcement learning harder than supervised learning?

In supervised learning, the correct output is provided for each input, so learning is immediate. In reinforcement learning, the agent takes actions and only receives reward later, which creates the credit assignment problem: it’s difficult to determine which earlier actions caused eventual success or failure. Reinforcement learning also faces stability issues because learning from trial and error can destabilize training, and it must include exploration—trying actions it hasn’t tried before—to discover better strategies.

How does contrastive unsupervised learning speed up reinforcement learning from pixels?

The lecture describes a gap: learning from images is often about 100× slower than learning from underlying state in simulators. The proposed fix adds a contrastive head to the RL encoder, training it so that augmented observations from the same underlying state are close in representation space while different states are far apart. Random crop is highlighted as an effective augmentation for generating variance. With this representation learning, image-based RL can match state-based learning curves in many environments and outperform prior pixel-based RL methods (though not universally).

Review Questions

  1. Which parts of semi-supervised learning rely on distribution alignment, and how does unsupervised pretraining change the assumptions?
  2. Explain the credit assignment problem in reinforcement learning and give an example of why delayed reward makes it difficult.
  3. What role does contrastive learning play in bridging the pixel-vs-state learning gap for reinforcement learning?

Key Points

  1. 1

    arXiv’s machine learning paper volume is high enough that individuals need curation strategies; reading everything is not feasible.

  2. 2

    Semi-supervised learning can improve accuracy by propagating labels from a small labeled set into dense regions of unlabeled data, then retraining with noise (noisy student).

  3. 3

    Noisy student depends on pseudo-label confidence and on the unlabeled data matching the labeled distribution; distribution mismatch limits its reliability.

  4. 4

    Unsupervised pretraining with a shared trunk plus a supervised fine-tuning head enables one general model to transfer across many supervised tasks, with scaling often improving results.

  5. 5

    In vision, contrastive learning (SimCLR/MoCo) uses augmented views as positives and other images as negatives, producing reusable features that can approach supervised performance after fine-tuning.

  6. 6

    Reinforcement learning’s core difficulties are delayed reward (credit assignment) and training stability during exploration; pixel-based RL is often much slower than state-based RL.

  7. 7

    Simulation-to-real progress often comes from domain randomization and domain adaptation rather than perfect realism, and science applications like AlphaFold show deep learning’s reach beyond classic ML benchmarks.

Highlights

Semi-supervised learning can work by “growing” labels from a few annotated points into nearby dense regions, then retraining a model with injected noise to improve robustness.
GPT-2 is framed as a turning point: a single next-token-pretrained model, fine-tuned per task, beat prior specialized approaches across many NLP benchmarks.
Contrastive learning trains vision models by pulling together embeddings of two augmented views of the same image and pushing apart embeddings from different images, enabling linear-probe fine-tuning.
In reinforcement learning, learning from pixels can be dramatically slower than learning from state; contrastive representation learning can largely close that gap in many environments.
Domain randomization in robotics can produce real-world competence even when training never sees real images, by training on many randomized simulator variations.

Topics

Mentioned

  • Yannick Kilcher
  • Andrej Karpathy
  • Jack Clark
  • Sergey Brin
  • Josh Tenenbaum
  • Jeff Hinton
  • Kaiming He
  • Ashley (OpenAI robotics example)
  • Jason Pang
  • Ken Hao
  • Jang (OpenAI/berkeley context)
  • Rocky Dwan
  • Peter Chen
  • Brett (Berkeley robot context)
  • Ashley (robotics example)
  • DeepMind (company name, not a person)
  • Sergey (spelled as Sergey in transcript)
  • Josh (spelled as Josh in transcript)
  • Andering the batch (newsletter author referenced as Andering)
  • AI
  • NLP
  • RL
  • DQN
  • TRPO
  • A3C
  • DWG
  • PPO
  • DDQN
  • SimCLR
  • MoCo
  • GPT-2
  • GPT-3
  • BERT
  • CASP