4. Archetypes - ML Projects - Full Stack Deep Learning

TL;DR

Machine learning projects cluster into three archetypes—improving, augmenting, and automating—and each archetype changes what “success” must prove.

Briefing Cornell Notes

Briefing

Machine learning projects tend to fall into three archetypes—improving an existing process, augmenting a manual workflow, or automating a manual workflow—and each archetype forces different success criteria. The biggest practical takeaway is that “better model metrics” only matter if they translate into measurable downstream outcomes and real business value; otherwise teams risk building technically impressive systems that don’t move the needle.

For projects that improve an existing process—such as upgrading IDE code completion, moving from coarse category-based recommendations to highly personalized ones, or replacing hand-tuned game AI rules with reinforcement learning—the central challenge is proving the improvement where it counts. That requires strong instrumentation and principled production deployment so teams can verify that model changes actually improve downstream metrics tied to user behavior. Even then, performance gains must map to business value: if recommendations increase clicks but the business doesn’t benefit, the work may not be worth doing. A further question is whether gains create a data flywheel: better predictions can lead to a better product, which attracts more users, which generates more (and ideally better) data for training.

For projects that augment manual processes—turning sketches into slides, offering email auto-completion, or speeding up radiology work—the bar shifts from “replace the workflow” to “make the human faster without breaking trust.” Teams need to determine how good the system must be to be genuinely useful and how to collect enough data to reach that level. The key difference from pure improvement projects is that augmentation often lacks built-in data collection, so teams must design mechanisms to gather feedback and training signals.

For projects that automate manual processes—fully automated customer support, fully automated design, or self-driving—feasibility hinges on failure rates and coverage. Even a model that achieves 99.999% accuracy on validation data may still fail more often than desired in the real world unless the dataset is comprehensive enough to represent what the system will encounter. That leads to two hard operational questions: how to guarantee real-world failure-rate targets, and how to label new data cheaply when the system makes mistakes. If predictions can’t be automatically verified, teams must rely on costly manual labeling, which can stall iteration.

The discussion then ties these archetypes to an impact-versus-feasibility trade-off. Improving existing processes is usually the most feasible but often has the lowest ceiling on impact. Impact can rise when teams build data loops that automatically collect and label user data, enabling continuous performance improvement and potentially expanding automation over time. Augmentation and automation become more feasible through product design and faster releases: shipping a “good enough” version early helps move from augmentation-from-scratch toward measurable improvement. Product design examples include Facebook’s tag suggestions that let users confirm labels, Grammarly’s suggestions that remain user-in-the-loop, and Netflix explanations that invite feedback.

When full automation is the goal, teams typically add humans in the loop or enforce guardrails that restrict operation to safer boundaries. The session also highlights alternative learning paradigms and applications beyond the classic supervised setup: DARPA’s early self-driving work used high-quality sensing (like lidar) to label cheaper sensors after the fact, reinforcement learning reframes learning as specifying success rather than providing step-by-step instructions, and large-scale, hard-to-interpret domains like fraud detection illustrate cases where humans can’t process the data but models can. Finally, generative systems—such as video generation and voice replication—show how machine learning can create capabilities that didn’t exist before.

Cornell Notes

Machine learning projects cluster into three archetypes: improving an existing process, augmenting a manual workflow, and automating a manual workflow. Each archetype demands different proof of success: improvement projects must show downstream metric gains and business value; augmentation projects must reach “good enough” quality and gather data despite limited built-in labeling; automation projects must meet strict real-world failure-rate targets and handle comprehensive coverage and costly labeling.

A central concept is the data flywheel: better models can improve the product, attract more users, and generate more data that further improves models. Feasibility often starts highest in “improve” projects, then can grow in impact through product design, early “good enough” releases, and data loops. For full automation, humans-in-the-loop and guardrails are common strategies to make safety targets achievable.

How do teams avoid the trap of “better model accuracy” that doesn’t improve what matters in production?

Improvement archetypes require instrumentation that connects model changes to downstream metrics tied to real user behavior. That means deploying in production in a principled way so teams can verify that the model improves the outcomes the business actually cares about (for example, whether recommendation changes increase meaningful engagement). The discussion emphasizes that without this measurement chain, teams can’t tell whether performance gains translate into downstream behavior.

What’s the key difference between augmenting a manual process and improving an existing process?

Augmentation often lacks an automatic data collection mechanism. In an improvement setting, the system already exists and can generate feedback loops naturally. In augmentation, teams must decide how accurate the system must be to be useful and then design ways to collect enough data to reach that threshold—often relying on user interaction patterns to generate labels or corrections.

Why is validation-set accuracy insufficient for fully automated systems like self-driving?

Because real-world failure rates depend on whether the dataset covers the full range of conditions the system will face. Even extremely high validation accuracy (e.g., 99.999%) may not correspond to the desired real-world failure rate if the validation data misses rare but critical scenarios. The transcript frames this as a coverage and comprehensiveness problem: the dataset must represent everything needed to hit the target failure rate.

What makes labeling hard in automation projects, and how does that affect iteration speed?

If the system’s predictions can’t be automatically verified, then mistakes must be manually collected and labeled to improve the model. That creates a bottleneck: without cheap labeling from system failures, performance improvements become slow and expensive. The self-driving example highlights the need to capture data from mistakes so the system can learn from them.

How can product design increase the feasibility and impact of ML projects?

Product design can turn user interactions into usable feedback and labels while keeping humans in control when the system is uncertain. Examples given include Facebook tag suggestions that ask users to confirm labels, Grammarly suggestions that help without full automation, and Netflix explanations that invite feedback. These design choices help create data loops and make early releases more effective.

What alternative learning paradigms were cited as going beyond classic supervised learning?

The transcript points to DARPA’s early self-driving work, where expensive high-quality sensing (lidar) was used to label cheaper sensing after the fact by tracing the vehicle’s path with geometric precision. It also contrasts reinforcement learning, where success is specified without providing step-by-step instructions, making learning harder but enabling different kinds of problem-solving. Large-scale fraud detection and generative video/voice systems were also mentioned as domains where the data scale or task nature pushes beyond what humans can do directly.

Review Questions

Which downstream metrics and business outcomes should be instrumented to prove that an “improve an existing process” ML project is actually working?
What dataset coverage and labeling strategy would you design to ensure a target real-world failure rate for a fully automated system?
How would you build a data flywheel for an augmentation workflow where labels aren’t naturally generated?

Key Points

1
Machine learning projects cluster into three archetypes—improving, augmenting, and automating—and each archetype changes what “success” must prove.
2
Improvement projects need production instrumentation that links model changes to downstream metrics and business value, not just offline accuracy.
3
Augmentation projects must define how good the system must be to help users and must engineer ways to collect enough data despite limited built-in labeling.
4
Automation projects require real-world failure-rate guarantees, which depend on dataset comprehensiveness and on capturing/labeling mistakes efficiently.
5
A data flywheel can turn model improvements into more users and more training data, but it only works if the product truly improves and data collection is automated.
6
Good product design and early “good enough” releases can raise feasibility by creating feedback loops and moving teams from augmentation toward measurable improvement.
7
Humans-in-the-loop and guardrails are common strategies to make full automation safer and more feasible.

Highlights

The core proof for “improving an existing process” is whether downstream behavior improves in production—and whether that improvement maps to business value.

For full automation, validation accuracy alone doesn’t ensure safety; the dataset must cover the real-world distribution well enough to meet a specific failure-rate target.

Product design can generate labels and feedback without full automation—examples include user-confirmed tag suggestions and Grammarly-style suggestions.

A data flywheel depends on a chain: better models → better product → more users → more (and better) data.

Reinforcement learning reframes learning around specifying success rather than providing step-by-step guidance, changing what signals are available for learning.

Topics

Machine Learning Archetypes
Data Flywheel
Downstream Metrics
Human-in-the-Loop
Reinforcement Learning