Lecture 06: Continual Learning (FSDL 2022)

TL;DR

Continual learning should be implemented as a structured retraining strategy with explicit rules for logging, curation, triggers, dataset formation, offline testing, and online testing.

Briefing Cornell Notes

Briefing

Continual learning in production is less about “retraining whenever something feels off” and more about running a structured retraining strategy that can adapt to a continuous stream of real-world data. The core problem is that production models rarely have reliable, end-to-end measurement: teams often rely on spot checks until user complaints or business metrics dip, at which point fixes become ad hoc—SQL queries, notebooks, emergency retraining, and another round of limited validation. The lecture argues that this is why continual learning remains one of the least mature parts of the production ML lifecycle, and it proposes a clearer way to design it: treat retraining as an outer loop with explicit rules for logging, data curation, retraining triggers, dataset formation, offline testing, and online testing.

At the center is the idea of a “retraining strategy,” a bundle of decisions that governs every stage of the loop. Logging determines what data from an infinite stream gets stored for later analysis. Curation decides which unlabeled (or weakly labeled) production data gets prioritized for labeling and potential retraining, producing a finite reservoir of candidate training points. Retraining triggers specify when the system should retrain. Dataset formation selects which subset of the reservoir feeds a particular training job. Offline testing defines what “good enough” means for stakeholders, typically via a sign-off report comparing the candidate model to the previous one across key metrics and slices. Online testing then validates that the deployment succeeded—often via shadow mode, A/B tests, gradual rollout, and rollback if needed.

Once the first model ships, the lecture reframes the ML engineer’s job: not constant model retraining, but “babysitting” and improving the retraining strategy itself using monitoring and observability. That means tuning the rules that decide what to log, what to sample, what to test, and when to retrain—based on signals that indicate whether the system is improving over time despite changes in the world.

A baseline strategy is periodic retraining: log everything (or as much as feasible), sample uniformly for labeling up to capacity, retrain on a fixed cadence (e.g., weekly using the last month of data), and validate via offline accuracy thresholds plus manual spot checks. But periodic retraining can fail when data volume exceeds logging/labeling capacity (especially with long-tail edge cases or expensive human-in-the-loop labeling), when retraining costs are high, or when the cost of bad predictions makes frequent retraining too risky.

To iterate beyond the baseline, the lecture emphasizes monitoring signals ranked by usefulness: user outcome/feedback first, then model performance metrics, then proxy metrics (domain-specific indicators correlated with failure), followed by data quality and distribution drift. Distribution drift is important for debugging but ranked lower because models can be robust to shifts in input distributions; drift alerts don’t necessarily indicate degraded performance. The monitoring mindset borrows from observability: measure “known unknowns” with key metrics and enable “unknown unknowns” by retaining raw context for deep dives.

Finally, the lecture connects monitoring to data curation and retraining: the same tools—projections, uncertainty, cohort analysis, and feedback loops—help decide what data to label and what to test. It closes with a concrete workflow: an alert (e.g., user feedback worsens), subgroup investigation (e.g., new users), error analysis (e.g., emojis), strategy updates (new cohorts, new projections, new test cases), retraining, and a new model sign-off—turning continual learning into a repeatable improvement loop rather than a cycle of surprises.

Cornell Notes

Continual learning is framed as an outer loop around a production model: log production data, curate it into a labeled reservoir, decide when to retrain, form datasets for each training run, validate candidates offline, and confirm success online before rollout. The lecture’s key move is defining a “retraining strategy” as explicit rules for each stage, then using monitoring and observability to tune those rules over time. User feedback is treated as the most valuable signal, with model metrics and proxy metrics following when direct outcomes aren’t available. Distribution drift is useful for debugging but not sufficient on its own because models can remain accurate even when input distributions shift. The practical recommendation is to start simple (periodic retraining) and add automation and smarter sampling only after measurement and evaluation become reliable.

What does a “retraining strategy” include, and why is it central to continual learning?

A retraining strategy is the set of rules that governs the continual learning loop end-to-end. It specifies: (1) what production data gets logged for later use, (2) how unlabeled data is curated and prioritized for labeling to build a finite labeled reservoir, (3) what retraining triggers decide when to start a training job, (4) which subset of the reservoir forms the dataset for that job, (5) what offline testing and sign-off criteria define “good enough” across metrics and slices, and (6) what online testing signals confirm deployment success (e.g., shadow mode, A/B tests, gradual rollout, rollback). Treating these as tunable rules turns continual learning from ad hoc retraining into a controllable system.

Why is user feedback ranked above offline accuracy for monitoring?

User outcome/feedback is ranked highest because it directly reflects what matters to the product. Offline metrics like accuracy can improve while user behavior stays the same or even worsens due to loss mismatch—users may not care about the specific error types that accuracy captures. The lecture gives examples of product-dependent feedback: recommenders might use click-through, while self-driving cars might use whether users intervene (e.g., take over autopilot). When outcome feedback isn’t feasible, proxy metrics (like repetitive or toxic outputs for text generation, or reduced personalized responses for recommenders) become the next best option.

When does periodic retraining work well, and what are its main failure modes?

Periodic retraining works as a pragmatic baseline: log data, sample uniformly up to labeling capacity, label with automated tools, retrain on a cadence (e.g., weekly using the last month), and validate via offline thresholds or manual spot checks plus online spot evaluations. It fails when (1) data volume exceeds logging/labeling capacity—especially with long-tail edge cases or expensive human-in-the-loop labeling where uniform sampling misses rare events; (2) retraining is too costly to run frequently; or (3) the cost of bad predictions is high, making frequent retraining risky because training data may be corrupted, attacked, or no longer representative.

Why is distribution drift monitoring ranked lower than other signals?

Distribution drift is ranked lower because a changed input distribution doesn’t guarantee worse model performance. The lecture uses a toy example where distributions shift across the classifier boundary yet the model performs similarly, illustrating that drift can be irrelevant to outcomes. It also notes practical issues: drift detection requires choosing reference windows, storing data, selecting distance metrics, and defining projections—so drift alerts may be noisy or misleading. Drift remains valuable for debugging when performance degrades, but it’s not the primary “should we intervene?” signal.

How do projections help with monitoring and drift detection for high-dimensional data?

Projections reduce high-dimensional inputs (images, text, large feature vectors) into lower-dimensional representations where drift detection becomes tractable. The lecture recommends this approach for unstructured/high-dimensional data: define analytical projections using domain knowledge (e.g., mean pixel value, sentence length) for interpretability, or use generic projections like random projections or auto-encoder embeddings. Projections also reappear in other parts of continual learning, such as curation and debugging, because they help turn complex data into measurable distributions.

What are the main options for selecting data during retraining (dataset formation)?

Dataset formation can be done by (1) training on all available curated data, requiring version control of data and curation rules; (2) using a sliding window to bias toward recent data, with checks for distribution changes between old and new windows; (3) sampling from the reservoir when full training is infeasible, including online batch selection where a larger candidate batch is ranked by a label-aware selection function and only top items are used; and (4) continual fine-tuning (training only on new data), which can be cost-effective but risks catastrophic forgetting and needs mature evaluation. The lecture highlights online batch selection as a promising practical technique, while discouraging continual fine-tuning today unless evaluation is strong.

Review Questions

Which components of a retraining strategy determine what data is stored, what data is labeled, when retraining happens, and how “good enough” is decided?
Give an example of a proxy metric for a specific ML task and explain what failure it would catch.
Why might distribution drift alerts fail to predict user-facing degradation, and what signal would be more reliable?

Key Points

1
Continual learning should be implemented as a structured retraining strategy with explicit rules for logging, curation, triggers, dataset formation, offline testing, and online testing.
2
User feedback/outcome signals are the most valuable monitoring metrics because they reflect product impact and avoid loss mismatch.
3
Periodic retraining is a reasonable starting point, but it breaks down with long-tail/rare events, high retraining costs, or high risk from bad predictions.
4
Monitoring should follow an observability mindset: retain raw context for deep debugging, measure known unknowns with key metrics, and support unknown unknowns with flexible analysis.
5
Distribution drift is useful for debugging but not sufficient as a primary intervention signal because models can remain accurate despite input distribution changes.
6
Data curation and monitoring are two sides of the same coin: cohort analysis, projections, uncertainty, and feedback loops can drive both what to label and what to investigate.
7
Dataset formation choices (all-data, sliding windows, sampling/online batch selection, or continual fine-tuning) trade off recency, coverage, cost, and forgetting risk.

Highlights

Continual learning is framed as an outer loop: logging → curation → retraining triggers → dataset formation → offline testing → online testing, all governed by a tunable retraining strategy.

The lecture ranks monitoring signals by usefulness: user outcomes first, then model metrics, then proxy metrics, with distribution drift treated as a debugging aid rather than a direct “fix now” signal.

Projections are recommended for drift detection in high-dimensional, unstructured data because they make monitoring feasible and interpretable.

A practical baseline is weekly retraining on the last month of data, but it can fail with rare edge cases, labeling constraints, or expensive retraining.

The recommended workflow for improvement is alert → subgroup analysis → error analysis → strategy updates (cohorts/projections/test cases) → retraining → sign-off and rollout.

Topics

Continual Learning
Retraining Strategy
Monitoring & Observability
Data Curation
Distribution Drift

Mentioned

Andrej Karpathy
FSDL