Get AI summaries of any video or article — Sign up free
Jeremy Howard on Platform.ai and Fast.ai (Full Stack Deep Learning - March 2019) thumbnail

Jeremy Howard on Platform.ai and Fast.ai (Full Stack Deep Learning - March 2019)

The Full Stack·
6 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Augmented machine learning—pairing human perception and judgment with computer speed—can outperform fully automated AutoML for most near-term tasks.

Briefing

Jeremy Howard argues that “augmented machine learning”—tight human–computer collaboration—beats fully automated ML pipelines for most practical problems, and that the fastest path to strong results often comes from combining what humans do well (rapid perception, similarity/difference judgment, and targeted labeling) with what computers do well (speed, memory, and large-scale optimization). He frames this as a direct challenge to AutoML’s goal of minimizing human involvement, saying that until computers become better than humans at everything (a far-off scenario), the best systems will keep humans in the loop.

Howard illustrates the point with Platform.ai, a workflow designed to make labeling and dataset building dramatically more efficient. The system starts with unlabeled or minimally labeled images (cars, faces, and other categories) and uses interactive projections to let humans quickly find clusters and outliers. Humans rapidly identify where examples look similar, then zoom in to spot differences—turning perception research into a practical labeling strategy. When the model struggles (for example, distinguishing car fronts from backs), the workflow adapts: humans provide a few examples, and the system generates “find similar” results using a pretrained ImageNet model. The interaction becomes a kind of visual dialogue—humans indicate what they’re trying to separate, while the model proposes projections that maximize the difference between chosen groups. With surprisingly few labels (on the order of hundreds), the system trains a classifier that reaches high accuracy (he cites 92% in the car example) and produces better embeddings after each round.

He extends the same idea beyond labeling. In a face dataset, the first projection already separates men and women well enough that bulk selection becomes feasible; later projections reveal additional structure (like sunglasses as an emergent category). The payoff is iterative: improved models yield better projections, which speed up the next labeling cycle. Howard also connects this approach to broader research on “human-augmented” training, emphasizing that studying human strengths—especially perception—can lead to better ML systems than trying to remove humans entirely.

From there, the conversation shifts to how data scientists can get strong results quickly without massive compute or exhaustive hyperparameter searches. Howard highlights research from Fast.ai’s ecosystem and its fellowship program, where the central theme is “spend a little human time instead of a lot of GPU time.” Examples include using a learning-rate finder (a quick procedure that identifies a good learning rate by running short experiments) and findings that, for transfer learning, default hyperparameters often work nearly as well as elaborate tuning. He describes classroom outcomes where beginners frequently reach near-perfect validation accuracy with only 100–200 images after using transfer learning and sensible defaults.

He then stacks additional practical accelerators: test-time augmentation (TTA) to average predictions over multiple inference-time transforms; progressive resizing to train first on smaller images and later on larger ones (speeding training while improving generalization); and “one-cycle” learning rate schedules paired with momentum changes to train faster and more reliably. Howard also discusses training reliability tricks for transfer learning (training newly initialized layers more aggressively), and optimizer details such as correct decoupled weight decay (AdamW) and ways to prevent Adam instability via gradient clipping or adjusting epsilon. The overall message is consistent: strong ML results come from combining smart defaults, efficient training heuristics, and deliberate human–computer interaction rather than brute-force automation or compute-heavy search.

Cornell Notes

Jeremy Howard argues that “augmented machine learning” outperforms fully automated ML because humans excel at perception tasks like spotting similarities/differences and making targeted judgments, while computers excel at speed and optimization. He demonstrates this with Platform.ai, an interactive labeling and training workflow that uses visual projections, “find similar,” and difference-maximizing views to separate classes (cars, faces) with only a few hundred labels, reaching high accuracy quickly. He then broadens the theme to practical training: transfer learning with strong defaults often beats massive hyperparameter grids, and quick techniques like learning-rate finding, test-time augmentation (TTA), progressive resizing, and one-cycle schedules can yield state-of-the-art results on a single GPU. The key takeaway is that small, human-guided steps plus well-chosen training heuristics can replace expensive trial-and-error compute.

Why does Howard say AutoML is the wrong target, and what alternative does he propose?

He criticizes AutoML’s push to automate ML with little or no human involvement, arguing that humans remain valuable because they do certain tasks better than machines—especially perception and judgment. The proposed alternative is “augmented machine learning,” where humans and computers work together. Howard claims the combined approach can outperform AutoML and will keep winning until computers become better than humans at everything, which he treats as a long-term prospect.

How does Platform.ai speed up labeling compared with traditional annotation workflows?

Platform.ai uses interactive visual projections so humans can quickly identify regions where images look similar, then zoom in to find differences. After initial labels, the system supports actions like “find similar” using a pretrained ImageNet model, which helps humans rapidly expand a labeled set. When separation is hard (e.g., car fronts vs backs), humans provide examples and the system generates projections that maximize the difference between chosen groups, turning labeling into an iterative human–computer dialogue.

What role do pretrained ImageNet models play in the car and face examples?

The workflow starts with a pretrained ImageNet model to generate embeddings and projections. Howard notes that ImageNet models are often trained to be position invariant, which can make certain distinctions difficult at first (like fronts vs backs). The system compensates by using a small number of human-provided examples to steer the next rounds of projections and training, improving separability over iterations.

What does Howard claim about hyperparameter search for transfer learning?

He argues that massive grid searches are often unnecessary for transfer learning. In a study measuring cosine proximity between penultimate-layer activations of pretrained ImageNet networks and target datasets, results showed that the default hyperparameters from fastai plus a learning-rate finder are close to optimal for most datasets. Only small gains (about 1.7% on one dataset) appeared, suggesting that defaults plus a quick learning-rate check frequently work nearly as well as extensive tuning.

Which training heuristics does Howard highlight for speed and reliability, and what do they do?

He highlights several: (1) learning-rate finder—run a short sweep to pick a good learning rate quickly; (2) test-time augmentation (TTA)—apply multiple inference-time augmentations and average predictions; (3) progressive resizing—train on smaller images first (e.g., 64×64) then move to larger sizes (e.g., 128 then 224) to speed training and reduce overfitting; and (4) one-cycle scheduling—warm up slowly for ~30% of epochs, then cool down for ~70%, reversing learning rate and momentum behavior to train faster and more stably.

What optimizer-related issues does Howard mention for Adam/AdamW, and how can they be mitigated?

He emphasizes that AdamW (decoupled weight decay) can work extremely well when weight decay is implemented correctly, contrasting it with common mis-implementations. He also warns about instability late in long Adam training runs and suggests fixes like gradient clipping or increasing epsilon (EPS) to prevent gradient blow-ups when the moving average of gradient squared becomes very small. He ties epsilon behavior to numerical stability concerns discussed in documentation and prior training experience.

Review Questions

  1. In Howard’s Platform.ai workflow, what specific human actions trigger the system to generate new projections or retrieve similar images?
  2. Why does progressive resizing both speed up training and often improve results, according to Howard?
  3. What evidence does Howard cite that default hyperparameters plus a learning-rate finder can outperform or match expensive hyperparameter grids in transfer learning?

Key Points

  1. 1

    Augmented machine learning—pairing human perception and judgment with computer speed—can outperform fully automated AutoML for most near-term tasks.

  2. 2

    Platform.ai accelerates dataset creation by using interactive visual projections that let humans rapidly find similarity regions, zoom into differences, and iteratively refine separation.

  3. 3

    A small number of human labels (hundreds) can drive large improvements when the system uses pretrained ImageNet embeddings and then retrains after each interaction cycle.

  4. 4

    For transfer learning, large hyperparameter grid searches are often unnecessary because defaults plus a learning-rate finder frequently land near optimal performance.

  5. 5

    Practical speedups include test-time augmentation (TTA), progressive resizing (train small then scale up), and one-cycle learning rate schedules with coordinated momentum changes.

  6. 6

    Training reliability improves when newly initialized layers are trained more aggressively (via freezing or layer-specific learning rates) and when optimizers like AdamW are implemented correctly.

  7. 7

    Adam-related instability can be mitigated using gradient clipping or adjusting epsilon (EPS), especially during long training runs.

Highlights

Platform.ai turns labeling into an iterative visual dialogue: humans select clusters or examples, and the system generates projections that maximize the difference between the groups humans want to separate.
Howard claims transfer learning often needs little tuning: default fastai settings plus a learning-rate finder can match or nearly match the gains from extensive hyperparameter searches.
Progressive resizing can make training several times faster while improving generalization by introducing a form of augmentation that’s hard to overfit.
One-cycle scheduling pairs learning-rate warmup/cooldown with momentum reversal, enabling faster training (sometimes 3–4×) without sacrificing reliability.

Topics

  • Augmented Machine Learning
  • Human-in-the-Loop Labeling
  • Transfer Learning Defaults
  • Learning Rate Finder
  • Progressive Resizing
  • One-Cycle Training
  • Test-Time Augmentation
  • AdamW Training

Mentioned