Lecture 5: ML Projects (Full Stack Deep Learning

TL;DR

Treat ML development as an iterative lifecycle with feedback loops across planning, data, training/debugging, and deployment/testing.

Briefing Cornell Notes

Briefing

Machine learning projects fail less because models are “bad” and more because teams start with unclear goals, unrealistic feasibility assumptions, and weak planning for how a model will behave in the real world. A widely cited industry statistic—often framed as “85 percent of AI projects fail”—isn’t about a precise number so much as a warning: turning a trained model into a production system is hard, and many efforts get stuck in proof-of-concept demos, unclear success criteria, or poor scoping and management.

The lecture’s core prescription is to treat ML work like engineering with a full lifecycle, not a one-way pipeline from training to deployment. Planning and project setup come first: define requirements, goals, and constraints (including ethical considerations when relevant), then move into data collection and labeling. Crucially, the process is iterative. Teams should expect to loop back when data is too hard to obtain, labels are unreliable, or requirements conflict—such as accuracy versus latency trade-offs. Training and debugging is where teams build baselines (sometimes non-ML, like rule-based or OpenCV approaches), reproduce state-of-the-art results, and then iterate on model improvements; failures here can trigger more data collection or even a rethink of the task itself. Deployment and testing then includes pilots, regression tests to prevent regressions, and checks for bias. Even after rollout, performance gaps in the pilot or production often require looping back to training, data, or even the original success metric and requirements.

From there, the lecture shifts to how to choose which ML projects to pursue. Prioritization uses two axes: potential impact and feasibility. High-impact opportunities often come from “cheap prediction” (automating expensive expert judgments), reducing friction in the product experience, automating complicated manual processes, or replacing brittle rule-based logic with learned behavior. Feasibility is driven by three cost drivers: data availability (acquisition difficulty, labeling cost, stability, and security constraints), accuracy requirements (how costly wrong predictions are and how frequently the system must be correct), and intrinsic problem difficulty (whether the problem is well-defined as ML, whether similar work exists, compute/inference constraints, and whether a human can solve it from the same inputs).

The lecture also emphasizes that accuracy demands can explode costs. Raising required accuracy by “more nines” typically requires substantially more data and higher-quality labels, so project cost can scale super-linearly with stringent accuracy targets.

To make project development manageable, teams should pick a single optimization metric at a time—even though real systems need multiple metrics like latency and robustness. The metric should reflect production requirements, and it should be revisited as the team closes gaps. For example in a running case study on pose estimation for robotic grasping, the system’s requirements include position error under about one centimeter, angular error within roughly five degrees, and inference under 100 milliseconds; early work should prioritize the biggest shortfall (e.g., angular error) and only later optimize runtime.

Finally, baselines are treated as essential guardrails. Good baselines provide a lower bound on achievable performance and help diagnose whether the model is underfitting or overfitting—using the same loss curves but different baseline comparisons to decide what to fix next. Baselines can be external (published results), internal (scripted heuristics, linear models), or human performance, with trade-offs between baseline quality and labeling cost. The lecture closes by tying everything together: iterate through the lifecycle, choose feasible high-impact projects, optimize with disciplined metrics, and use baselines to ensure effort targets the real bottleneck.

Cornell Notes

The lecture argues that ML success depends less on model training and more on disciplined project engineering: planning, data, training/debugging, and deployment/testing—repeated in loops as new evidence appears. Many projects fail because they’re poorly scoped, technically infeasible, or lack clear success criteria, so teams must assess feasibility using data availability, accuracy requirements, and intrinsic problem difficulty. During development, teams should optimize one metric at a time (even if production needs multiple metrics) and revisit that metric as performance improves. Baselines are the diagnostic foundation: comparing training/validation behavior against strong baselines reveals whether the next step should address underfitting or overfitting. Together, lifecycle iteration, metric focus, and baseline-driven debugging reduce the risk of getting stuck in demos that never become reliable production systems.

Why does the lecture treat ML projects as iterative loops rather than a linear pipeline?

Planning and project setup, data collection/labeling, training/debugging, and deployment/testing are described as stages that frequently feed back into earlier decisions. Teams loop back when data is too hard to collect or labels are inconsistent, when debugging reveals overfitting or unreliable labeling, or when requirements conflict (e.g., accuracy versus latency). Even after deployment, pilots can expose distribution shift, labeling mismatch, or edge cases that force new data collection, retraining, or even a redefinition of success metrics and requirements.

What makes a machine learning project “feasible,” and how do data, accuracy, and problem difficulty drive cost?

Feasibility is framed through three main cost drivers. First is data availability: how hard it is to acquire data, how expensive it is to label it, how stable the data is over time, and whether data security prevents collecting/inspecting user data. Second is accuracy requirements: how costly wrong predictions are (catastrophic in self-driving versus annoying in recommenders) and how frequently the system must be correct. Third is intrinsic difficulty: whether the problem is well-defined as ML, whether published work exists, compute/inference constraints from papers, and whether a human can solve the task from the same inputs.

Why does the lecture warn that tightening accuracy targets can make projects dramatically more expensive?

Accuracy requirements are described as scaling super-linearly with project cost. Moving from “good” to “many nines” accuracy typically demands much more data and higher-quality labels to reduce error rates to extremely low levels. The practical takeaway is that teams should treat accuracy targets as major budget drivers and validate them against what downstream use actually needs.

How should teams choose a single metric when real systems care about multiple metrics?

The lecture recommends a pragmatic single-number optimization mindset during model iteration. Teams start from production requirements, then train candidate models to see where they fail those thresholds. They prioritize the metric farthest from target (e.g., angular error) while temporarily thresholding others (e.g., position error) and ignoring runtime until the core task works. As performance improves and gaps close, the team revisits which metric to optimize next (e.g., shifting attention to latency once accuracy is sufficient).

What do baselines do beyond “measuring performance,” and how do they guide next steps?

Baselines provide a lower bound on expected performance and help determine whether the model is underfitting or overfitting. The lecture highlights that identical loss curves can imply different fixes depending on the baseline: if training performance is already near the baseline/human level, the gap between training and validation points to overfitting; if training performance is far below the baseline, the priority is underfitting—improving the model’s ability to learn before addressing generalization.

How do the lecture’s project archetypes change what “success” and “risk” look like?

Three archetypes are used: software 2.0 (improving rule-governed systems with ML), human-in-the-loop (humans review outputs before execution), and autonomous systems (limited or no human supervision). Each archetype shifts key questions: software 2.0 focuses on whether improvements translate to business value and enable a data flywheel; human-in-the-loop focuses on whether the system is useful enough and how to collect sufficient data; autonomous systems focus on acceptable failure rates, guardrails, and how cheaply data can be labeled from system behavior.

Review Questions

What specific conditions would justify looping back from deployment/testing to data collection, and what evidence would trigger that loop?
How would you decide which metric to optimize first if multiple production requirements are not yet met?
Give an example of how two different baselines could change the interpretation of the same training/validation loss curves.

Key Points

1
Treat ML development as an iterative lifecycle with feedback loops across planning, data, training/debugging, and deployment/testing.
2
Use feasibility assessment to avoid doomed projects by evaluating data availability, accuracy requirements, and intrinsic problem difficulty.
3
Prioritize projects using an impact-versus-feasibility lens, and look for opportunities like cheap prediction and friction reduction in products.
4
Optimize one metric at a time during model iteration, using thresholds for other metrics, and revisit the chosen metric as gaps close.
5
Expect accuracy targets to drive cost super-linearly; “more nines” usually means much more data and higher-quality labels.
6
Build strong baselines (scripted, external, and human where appropriate) to diagnose underfitting versus overfitting and to set realistic expectations.
7
Design product and system guardrails (including human-in-the-loop or constrained scopes) to make high-risk ML applications more feasible.

Highlights

A single trained model is rarely the end goal; production reliability requires regression tests, bias checks, pilots, and ongoing loops back to data and requirements when reality diverges from evaluation.

Accuracy requirements can scale super-linearly in cost: pushing for extreme “nines” typically demands far more data and label quality.

Baselines aren’t optional—they determine whether the next engineering step should address underfitting or overfitting, even when loss curves look identical.

Choosing one optimization metric is a practical necessity during training, but the metric should change as the team closes the biggest gaps.

Project archetypes (software 2.0, human-in-the-loop, autonomous systems) shift the definition of success and the risk controls needed.

Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)