5. Metrics - ML Projects - Full Stack Deep Learning
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Pick a single optimization metric early, but expect to revise it as constraints become clearer and performance improves.
Briefing
Choosing the right metric is the make-or-break decision that determines whether an ML project can be steered toward real-world usefulness. Because real systems involve trade-offs—accuracy, speed, error tolerance, fairness—engineers often need to compress multiple goals into a single number they can push up or down. That metric choice usually starts early, but it also changes as the model improves and as the team learns what constraints actually matter.
The transcript breaks down common classification metrics using a confusion matrix: accuracy measures the overall percentage of correct predictions; precision is the fraction of predicted positives that are truly positive (true positives divided by all predicted positives); recall is the fraction of actual positives that the model successfully finds (true positives divided by all actual positives). The key practical point is that different models can look better or worse depending on which metric is emphasized. When comparing several models with different precision/recall trade-offs, there’s no single “obvious” winner from raw numbers alone, so teams must decide how to combine or prioritize metrics.
One approach is to average metrics, which can make one model dominate even if its trade-offs are undesirable in deployment. Another common strategy is “optimize one, threshold the rest”: pick the primary metric to optimize, then enforce minimum or maximum acceptable levels for other constraints. For example, if accuracy is the priority but the system must also run within a time budget, engineers may optimize accuracy while requiring inference latency to stay under a set limit. Deciding which metrics to threshold versus optimize depends on domain judgment—what tolerances are acceptable—and on how sensitive each metric is to model changes. If one metric is already far from acceptable, it may be better to threshold the metrics that are already close to target and focus optimization on the worst one.
The transcript also highlights how to choose threshold values using baselines and metric importance. If a baseline model already performs well on a metric, the threshold can be set to demand improvement over that baseline. If a metric is critical to the downstream task, the threshold can be stricter. A concrete example uses a precision/recall rule: select the model with the best precision subject to recall exceeding 0.6, which can flip the “best model” compared with other selection rules.
For multi-class or trade-off-heavy settings, the discussion points to mean average precision (mAP), derived from the area under the precision-recall curve. mAP averages average precision across classes, capturing performance across varying decision thresholds rather than a single operating point.
The most detailed example ties metric selection to a real deployment target: real-time grasping. Requirements include less than one centimeter of position error, less than five degrees of angle error, and inference under 100 milliseconds. After training baseline models, the team compares current performance to these requirements: if angular error is around 60 degrees, position error is between 0.75 and 1.25 centimeters, and inference is 300 milliseconds, the rational next step is to prioritize the most violated constraint. The transcript describes thresholding position error at 1 centimeter to discard clearly inadequate models, temporarily ignoring runtime until accuracy is closer to feasible, and revisiting the metric strategy as angular error approaches the target.
In the Q&A, the transcript addresses “invalid metrics” as measures that either don’t correspond to what matters or can’t be measured reliably (e.g., judging generated image realism without a consistent human-evaluation process). It also covers satisficing: when bias elimination matters more than raw accuracy, teams need datasets and tests that can measure performance gaps across relevant categories, then set thresholds on those gaps. For human-in-the-loop systems and recommendation engines, reducing bias is framed as improving training data coverage for underperforming categories after identifying bias through targeted evaluation tests.
Cornell Notes
Metric selection turns messy real-world goals into something engineers can optimize. The transcript distinguishes accuracy, precision, and recall, then shows why model ranking changes depending on whether teams average metrics, optimize one metric while thresholding others, or use trade-off metrics like mean average precision (mAP). Thresholds should reflect domain tolerances, baseline performance, and which constraints are most violated. A grasping example demonstrates a staged strategy: discard models that miss key error limits, ignore runtime until accuracy is closer to feasible, then incorporate speed as the model improves. The Q&A adds that “invalid metrics” either don’t measure what matters or can’t be measured reliably, and that bias-focused work often requires satisficing via category-based tests and thresholds.
Why can’t teams rely on accuracy alone when precision and recall differ across models?
What does “optimize one metric and threshold the rest” mean in practice?
How should threshold values be chosen rather than picked arbitrarily?
What is mean average precision (mAP) and why does it matter for trade-offs?
How does the grasping example justify changing metrics over time?
What makes a metric “invalid,” and how does satisficing relate to bias?
Review Questions
- Give an example of how two models could swap order depending on whether the team optimizes precision or recall.
- In the grasping scenario, what decision rule would you use to prioritize which metric to optimize next, and why?
- How would you design a metric and dataset to measure bias in a recommendation engine without relying on overall accuracy alone.
Key Points
- 1
Pick a single optimization metric early, but expect to revise it as constraints become clearer and performance improves.
- 2
Use precision and recall to separate different error costs: false positives (precision) versus missed positives (recall).
- 3
When multiple goals conflict, optimize one metric while thresholding others that must meet hard deployment constraints.
- 4
Choose thresholds using domain tolerances, baseline performance, and how critical each metric is to the downstream task.
- 5
For trade-off-heavy tasks, consider mean average precision (mAP) because it summarizes performance across the precision-recall curve.
- 6
Avoid “invalid metrics” that either don’t measure what matters or can’t be measured reliably.
- 7
Bias-focused work often requires satisficing: evaluate category-level performance gaps and enforce thresholds rather than maximizing overall accuracy.