Lecture 2: Setting Up Machine Learning Projects - Full Stack Deep Learning - March 2019
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat machine learning work as a lifecycle with loops: planning, data collection/labeling, training/debugging, and deployment each can force revisiting earlier decisions.
Briefing
Machine learning projects succeed or fail less on model choice than on how well teams plan, collect data, test beyond validation scores, and set measurable goals that match real-world constraints. A full-stack robotics case study—training a pose estimation system to predict object position and orientation for robotic grasping—anchors the lecture and shows how requirements like centimeter-level position accuracy, degree-level angular accuracy, and 100-millisecond inference time shape what gets optimized first.
The lecture frames machine learning work as a lifecycle with feedback loops rather than a straight line. Teams start with planning and project setup: deciding whether pose estimation is worth pursuing, defining goals, estimating resources, and ensuring infrastructure exists. Next comes data collection and labeling, where new findings can force a return to planning—data may be too hard to gather or labels may be too expensive. Training and debugging then becomes a loop as well: teams may implement simple baselines (e.g., OpenCV) before deep models, reproduce results from papers or public datasets, and spend significant time debugging rather than only training. Overfitting, unreliable labels, or a task that proves infeasible can send teams back to data collection or even to redefining the objective. After deployment in controlled settings, tests and monitoring determine whether the system regresses or fails in the real world, which can trigger more training, more data, or revised assumptions such as the wrong performance metric.
A key missing ingredient in any checklist is domain state of the art. Planning requires knowing what is realistically possible—both to set achievable goals and to identify what to try next—so literature review becomes part of project setup. The lecture also distinguishes validation-set evaluation from broader testing. Validation score is one signal, but teams also need codebase sanity checks (can the model train end-to-end on a smaller dataset?), regression prevention (does a code change break training?), and targeted tests for rare but critical cases where mistakes matter.
When discussing how to handle “important” examples, the lecture emphasizes that collecting more data can help, but the deeper challenge is that black-box models can improve overall validation metrics while still getting specific edge cases worse. That risk motivates infrequent but principled test-set evaluation and careful dataset management to avoid overfitting to test data.
Project selection and feasibility are treated as a two-axis problem: impact versus feasibility. Impact can come from “cheap prediction” (AI reducing the marginal cost of making predictions, enabling new applications) and from “software 2.0” ideas where systems search for rules or code that meet goals, potentially replacing complex human-designed pipelines. Feasibility is driven mainly by data availability, the required accuracy, and overall problem difficulty. The lecture warns that accuracy requirements scale costs sharply—adding another “nine” of accuracy can multiply project cost dramatically—so product design can reduce how strictly the model must be trusted. Examples include suggestion-based systems like Grammarly and explanation-driven recommendations like Netflix.
Finally, the lecture lays out how to choose metrics and baselines. Real projects rarely optimize a single number, so teams combine metrics using thresholds or composite measures (precision/recall concepts and mean average precision are used as examples). Baselines provide a lower bound on expected performance; comparing training/validation gaps against human or scripted baselines helps diagnose whether the issue is overfitting, data scarcity, or model architecture. For the pose estimation case, teams enumerate requirements, train candidate models, and then pick the first optimization target—often the metric with the largest gap to requirements—while temporarily thresholding less critical objectives like position error until angular accuracy improves.
Cornell Notes
Machine learning projects run on a lifecycle—planning, data collection/labeling, training/debugging, and deployment—with loops that send teams back when data is too hard, labels are unreliable, models overfit, or real-world performance diverges from validation. Success depends on more than validation accuracy: teams need broader testing such as regression checks, training sanity metrics, and targeted tests for rare but high-stakes cases. Project selection should weigh impact against feasibility, with feasibility driven largely by data availability and the required accuracy level (which can scale costs steeply). Metrics must reflect real constraints, often using thresholds or composite strategies rather than a single score. Tight baselines (human, scripted, or published) clarify what performance is realistically achievable and what to fix next.
Why does the lecture treat machine learning as a lifecycle with feedback loops rather than a linear pipeline?
How is “testing” broader than evaluating on a validation set?
What’s the risk of repeatedly using test-set performance to guide improvements?
What framework guides choosing which machine learning projects to prioritize?
How should teams choose a single metric when real projects care about multiple objectives?
Why do baselines matter, and what do they diagnose?
Review Questions
- In what situations should teams loop back from training/debugging to data collection or even to planning?
- Explain how thresholding one metric while optimizing another can prevent trade-offs from derailing a project early on.
- What kinds of baselines (human, scripted, published) are most useful for diagnosing different failure modes?
Key Points
- 1
Treat machine learning work as a lifecycle with loops: planning, data collection/labeling, training/debugging, and deployment each can force revisiting earlier decisions.
- 2
Validation-set accuracy is not enough; add regression tests, training sanity checks, and targeted tests for rare but critical cases.
- 3
Project selection should balance impact and feasibility, using “cheap prediction” and “software 2.0” as impact lenses and data availability plus accuracy requirements as feasibility drivers.
- 4
Accuracy requirements can drive costs nonlinearly; product design can reduce how strictly the model must be trusted (e.g., suggestions with explanations).
- 5
Choose metrics that reflect real constraints; when multiple objectives matter, use thresholds or composite metrics rather than a single naive score.
- 6
Baselines provide a lower bound on achievable performance; compare model learning curves against human/scripted/published baselines to decide what to fix next.