Lecture 2: Setting Up Machine Learning Projects - Full Stack Deep Learning

TL;DR

Treat machine learning work as a lifecycle with loops: planning, data collection/labeling, training/debugging, and deployment each can force revisiting earlier decisions.

Briefing Cornell Notes

Briefing

Machine learning projects succeed or fail less on model choice than on how well teams plan, collect data, test beyond validation scores, and set measurable goals that match real-world constraints. A full-stack robotics case study—training a pose estimation system to predict object position and orientation for robotic grasping—anchors the lecture and shows how requirements like centimeter-level position accuracy, degree-level angular accuracy, and 100-millisecond inference time shape what gets optimized first.

The lecture frames machine learning work as a lifecycle with feedback loops rather than a straight line. Teams start with planning and project setup: deciding whether pose estimation is worth pursuing, defining goals, estimating resources, and ensuring infrastructure exists. Next comes data collection and labeling, where new findings can force a return to planning—data may be too hard to gather or labels may be too expensive. Training and debugging then becomes a loop as well: teams may implement simple baselines (e.g., OpenCV) before deep models, reproduce results from papers or public datasets, and spend significant time debugging rather than only training. Overfitting, unreliable labels, or a task that proves infeasible can send teams back to data collection or even to redefining the objective. After deployment in controlled settings, tests and monitoring determine whether the system regresses or fails in the real world, which can trigger more training, more data, or revised assumptions such as the wrong performance metric.

A key missing ingredient in any checklist is domain state of the art. Planning requires knowing what is realistically possible—both to set achievable goals and to identify what to try next—so literature review becomes part of project setup. The lecture also distinguishes validation-set evaluation from broader testing. Validation score is one signal, but teams also need codebase sanity checks (can the model train end-to-end on a smaller dataset?), regression prevention (does a code change break training?), and targeted tests for rare but critical cases where mistakes matter.

When discussing how to handle “important” examples, the lecture emphasizes that collecting more data can help, but the deeper challenge is that black-box models can improve overall validation metrics while still getting specific edge cases worse. That risk motivates infrequent but principled test-set evaluation and careful dataset management to avoid overfitting to test data.

Project selection and feasibility are treated as a two-axis problem: impact versus feasibility. Impact can come from “cheap prediction” (AI reducing the marginal cost of making predictions, enabling new applications) and from “software 2.0” ideas where systems search for rules or code that meet goals, potentially replacing complex human-designed pipelines. Feasibility is driven mainly by data availability, the required accuracy, and overall problem difficulty. The lecture warns that accuracy requirements scale costs sharply—adding another “nine” of accuracy can multiply project cost dramatically—so product design can reduce how strictly the model must be trusted. Examples include suggestion-based systems like Grammarly and explanation-driven recommendations like Netflix.

Finally, the lecture lays out how to choose metrics and baselines. Real projects rarely optimize a single number, so teams combine metrics using thresholds or composite measures (precision/recall concepts and mean average precision are used as examples). Baselines provide a lower bound on expected performance; comparing training/validation gaps against human or scripted baselines helps diagnose whether the issue is overfitting, data scarcity, or model architecture. For the pose estimation case, teams enumerate requirements, train candidate models, and then pick the first optimization target—often the metric with the largest gap to requirements—while temporarily thresholding less critical objectives like position error until angular accuracy improves.

Cornell Notes

Machine learning projects run on a lifecycle—planning, data collection/labeling, training/debugging, and deployment—with loops that send teams back when data is too hard, labels are unreliable, models overfit, or real-world performance diverges from validation. Success depends on more than validation accuracy: teams need broader testing such as regression checks, training sanity metrics, and targeted tests for rare but high-stakes cases. Project selection should weigh impact against feasibility, with feasibility driven largely by data availability and the required accuracy level (which can scale costs steeply). Metrics must reflect real constraints, often using thresholds or composite strategies rather than a single score. Tight baselines (human, scripted, or published) clarify what performance is realistically achievable and what to fix next.

Why does the lecture treat machine learning as a lifecycle with feedback loops rather than a linear pipeline?

Planning decisions (goals, resources, infrastructure) can change after data collection reveals that gathering or labeling is too difficult. Training and debugging can loop back when overfitting appears, labels are unreliable, or the task is too hard to meet requirements. Deployment can also trigger new cycles: a system that looks good on training/validation may fail in real-world conditions, forcing changes in training/data or even the original metric and assumptions.

How is “testing” broader than evaluating on a validation set?

Validation-set performance is one testing signal, but it doesn’t cover code regressions or targeted failure modes. The lecture highlights tests like: (1) codebase sanity checks—can the model train for one step and still work? (2) dataset-size sanity—does training on a smaller dataset reach an expected loss? (3) targeted tests for critical subsets—when some images (or scenarios) are rare but must be handled correctly, tests should reflect that importance.

What’s the risk of repeatedly using test-set performance to guide improvements?

If teams evaluate on test data frequently and use those results to make decisions (through model selection or training tweaks), they can overfit to the test set itself. The lecture recommends evaluating the test set infrequently and periodically recollecting data for it, keeping the test set as a more stable, less-tuned benchmark.

What framework guides choosing which machine learning projects to prioritize?

Impact versus feasibility. Impact mental models include “cheap prediction” (AI lowers marginal prediction cost, enabling new decision-making uses) and “software 2.0” (systems search for solutions given goals, potentially replacing complex human-written pipelines). Feasibility is driven mainly by data availability, required accuracy, and problem difficulty; data availability includes both collection difficulty and labeling cost.

How should teams choose a single metric when real projects care about multiple objectives?

The lecture argues that real-world systems often optimize one number at a time, so teams approximate by combining metrics using strategies like averaging or thresholding. It uses precision/recall and mean average precision (mAP) as examples: precision measures correctness among predicted positives, recall measures coverage of true positives, and mAP summarizes performance across recall thresholds. In the pose estimation case, teams enumerate requirements (position error, angular error, inference time) and may optimize angular error first while thresholding position error until angular accuracy meets its target.

Why do baselines matter, and what do they diagnose?

Baselines provide a lower bound on expected performance. Comparing training/validation gaps against baselines helps diagnose whether the model is overfitting (training good, validation worse), whether the model is fundamentally underperforming (validation close to a poor baseline), or whether architecture/training needs change. Baselines can be internal (requirements-based), published results (with fairness checks), scripted methods (like OpenCV), and human performance (including ensembles or domain experts).

Review Questions

In what situations should teams loop back from training/debugging to data collection or even to planning?
Explain how thresholding one metric while optimizing another can prevent trade-offs from derailing a project early on.
What kinds of baselines (human, scripted, published) are most useful for diagnosing different failure modes?

Key Points

1
Treat machine learning work as a lifecycle with loops: planning, data collection/labeling, training/debugging, and deployment each can force revisiting earlier decisions.
2
Validation-set accuracy is not enough; add regression tests, training sanity checks, and targeted tests for rare but critical cases.
3
Project selection should balance impact and feasibility, using “cheap prediction” and “software 2.0” as impact lenses and data availability plus accuracy requirements as feasibility drivers.
4
Accuracy requirements can drive costs nonlinearly; product design can reduce how strictly the model must be trusted (e.g., suggestions with explanations).
5
Choose metrics that reflect real constraints; when multiple objectives matter, use thresholds or composite metrics rather than a single naive score.
6
Baselines provide a lower bound on achievable performance; compare model learning curves against human/scripted/published baselines to decide what to fix next.

Highlights

Machine learning projects often fail because the wrong metric or assumption survives into deployment—real-world constraints (like latency) can force a return to training, data collection, or even goal redesign.

Targeted testing matters: a model can improve overall validation metrics while still getting rare, high-stakes examples worse, which validation averages may hide.

Accuracy requirements can scale project cost sharply—adding another “nine” of accuracy can multiply costs dramatically, so product design can shift the required level of trust.

Baselines turn ambiguous learning curves into actionable decisions by showing whether the problem is overfitting, underfitting, or fundamentally limited by the task/data setup.

For pose estimation, the lecture demonstrates metric prioritization: optimize the largest gap to requirements (often angular error) while thresholding other errors until the system meets downstream needs.

Topics

Machine Learning Project Lifecycle
Project Planning
Data Collection Labeling
Metrics and Baselines
Pose Estimation

Lecture 2: Setting Up Machine Learning Projects - Full Stack Deep Learning - March 2019