3. Prioritizing - ML Projects - Full Stack Deep Learning

TL;DR

Use a 2x2 grid to prioritize ML work by combining business impact with feasibility (cost and risk).

Briefing Cornell Notes

Briefing

Picking the right machine learning projects comes down to a simple but disciplined tradeoff: pursue work that delivers high business impact while staying feasible in cost and execution risk. A practical way to frame that decision is a 2x2 grid—high impact with low cost/feasibility sits at the top—then use mental models to identify where ML can pay off quickly. Two recurring targets stand out: places where “cheap prediction” can be applied broadly, and parts of a pipeline that currently rely on complicated, brittle manual rules.

The “cheap prediction” idea draws from the economics of AI, which argues that AI’s core shift is lowering the cost of making predictions. Because prediction is central to decision-making, cheaper prediction tends to spread into domains where it was previously too expensive to automate. In business terms, that means looking for workflows where prediction can be embedded into many decisions—so even modest accuracy improvements can translate into meaningful operational or revenue gains.

A second lens comes from “software 2.0,” associated with Andrej Karpathy’s framing: instead of writing explicit rules, teams specify goals and use data plus optimization to search for programs that achieve them. When this approach works, it tends to generalize better than hand-coded logic and can be implemented as neural-network-like programs, which opens the door to computational advantages. The implication for prioritization is straightforward: rule-heavy systems—especially those that are slow, brittle, or hard to maintain—are strong candidates for replacing hand-tuned heuristics with learned models.

Feasibility then becomes the other half of the equation, driven by three main cost levers. First is data availability, including not just whether data exists but how expensive it is to label. Second is the accuracy requirement: pushing performance from 99% to 99.9% can demand disproportionately more effort, because the remaining errors often come from rare cases that require collecting and labeling more “hard” examples. The cost growth with accuracy is described as super-linear, with the rough intuition that reducing error by 90% may require around 10x more data—often the dominant cost driver. Third is problem difficulty, which is hard to estimate but can be assessed using signals from published work and the compute required to reproduce results.

Compute intensity matters because state-of-the-art results may rely on thousands of GPUs, making them unrealistic for smaller teams to replicate. Deployment constraints also change feasibility: large models with many parameters can become impractical in compute-restricted environments, and accuracy targets may be unattainable under those limits.

Finally, “how costly are wrong predictions?” can dominate the difficulty assessment. In safety-critical settings like self-driving, the cost of failure is so high that diminishing returns on accuracy may still be worth pursuing—or may be insufficient because reliability and robustness remain unsolved. In lower-stakes contexts, the same error rate might be acceptable, making the project more feasible.

The discussion also maps what tends to remain difficult in machine learning. Beyond supervised learning, many open problems involve complex outputs (3D reconstruction, video prediction, dialogue), reliability under out-of-distribution conditions, robust performance against adversarial attacks, and generalization beyond interpolation. Even within supervised learning, tasks like speech recognition in noisy real-world conditions, symbolic reasoning, and planning/causality are highlighted as persistent challenges. Returning to a running example of pose estimation for robotic grasping, the framework suggests it can be a strong target: the pipeline likely contains rule-based bottlenecks suited for software 2.0, accuracy needs may be manageable if failure costs are low, and while published results exist, adapting them to a specific robot and environment remains non-trivial.

Cornell Notes

Project selection for machine learning is framed as a tradeoff between impact and feasibility. High-impact work often comes from two places: embedding “cheap prediction” into decision-making and replacing brittle, hand-tuned rule systems with learned models (“software 2.0”). Feasibility is driven mainly by data availability (including labeling cost), the accuracy requirement (cost rises super-linearly as accuracy tightens), and problem difficulty (including compute demands and deployment constraints). The cost of wrong predictions can make an otherwise similar task far harder—safety-critical failures can dominate the evaluation. Using these levers helps teams decide what to build and how risky it will be to reach the needed performance.

How does the “impact vs. feasibility” 2x2 framework guide ML project selection in practice?

Teams start by estimating business impact and feasibility, then prioritize work that sits in the high-impact/low-cost quadrant. Feasibility is assessed using cost drivers (data, accuracy, and problem difficulty) and risk signals (reproducibility and compute needs). Impact is often maximized by targeting parts of the pipeline where predictions can be made cheaply and used frequently, or where current workflows depend on complex manual rules that are slow and brittle.

What does “cheap prediction” mean, and where should it influence project choice?

“Economics of AI” frames AI’s central change as reducing the cost of prediction. Since prediction drives decisions, cheaper prediction makes prediction viable in many more contexts than before. The practical takeaway is to look for business workflows where embedding prediction into decisions creates broad leverage—so even incremental accuracy improvements can translate into significant operational or financial impact.

How does “software 2.0” differ from traditional software, and why does that matter for ML prioritization?

Traditional “software 1.0” relies on explicit human-written instructions compiled into machine code. “Software 2.0” shifts to specifying goals and using data plus optimization to search for programs that satisfy those goals. This often works better and generalizes more broadly because the learned programs resemble neural networks. That makes rule-heavy, heuristic pipelines strong candidates for replacement with learned models.

Why can improving accuracy from 99% to 99.9% become dramatically more expensive?

The intuition is that 99% accuracy means roughly 1 mistake per 100 cases. Moving to 99.9% requires cutting errors by about 90%, which often means addressing rare failure modes. Those rare cases require collecting and labeling many more examples that look like the mistakes, and since data availability and labeling cost are major cost drivers, the overall project cost can rise by around an order of magnitude (or more).

What signals help estimate problem difficulty when there’s limited prior work?

One signal is whether strong published results are recent; if everything is new (e.g., within the last year or two), reproduction risk and technical effort tend to be higher. Another is compute intensity: if state-of-the-art work uses thousands of GPUs, smaller teams may not be able to replicate it. Finally, deployment constraints matter—large models may be infeasible in compute-restricted environments, and accuracy targets may not be achievable under those limits.

How should teams incorporate the cost of wrong predictions into feasibility?

Wrong-prediction cost is domain-specific and can dominate the difficulty assessment. In recommendation systems, a wrong suggestion might annoy users only occasionally, so the cost can be relatively low. In safety-critical automation like self-driving, a wrong prediction can cause crashes or injuries, making the effective cost extremely high and raising the bar for reliability and robustness.

Review Questions

What are the three main cost drivers of ML project expense, and how does each one affect feasibility differently?
Give two examples of how “cheap prediction” could create business impact, and explain why cheap prediction changes what’s automatable.
Why might out-of-distribution robustness be a bigger challenge than improving average accuracy on a benchmark?

Key Points

1
Use a 2x2 grid to prioritize ML work by combining business impact with feasibility (cost and risk).
2
Target opportunities where prediction can be made cheap and used frequently in decision-making, since lower prediction costs expand where automation is viable.
3
Replace brittle, rule-based pipeline components with learned models when goals can be specified and data/optimization can find effective programs (“software 2.0”).
4
Treat data availability as the primary cost driver, and include labeling cost explicitly when estimating project expense.
5
Expect accuracy requirements to drive super-linear cost growth; rare error cases often require large increases in data and labeling.
6
Estimate problem difficulty using reproducibility signals (recency of published work) and compute requirements (e.g., GPU counts) rather than relying on benchmark results alone.
7
Incorporate the domain cost of wrong predictions—safety-critical failures can make a task far harder even if average accuracy seems reachable.

Highlights

A practical prioritization method pairs high business impact with low feasibility risk using an impact-vs-cost 2x2 grid.

Accuracy improvements near the top end can be disproportionately expensive because remaining errors often come from rare cases that require much more data and labeling.

Compute intensity is a feasibility constraint: state-of-the-art results built on thousands of GPUs may be unrealistic for smaller teams to reproduce.

The cost of wrong predictions can dominate the entire project assessment, especially in safety-critical systems like self-driving.

Hard ML problems often involve complex outputs, reliability under out-of-distribution conditions, and generalization beyond interpolation.

Topics

ML Project Prioritization
Cheap Prediction
Software 2.0
Feasibility Cost Drivers
Accuracy vs Cost