5. Metrics - ML Projects - Full Stack Deep Learning

Q: What makes a metric “invalid,” and how does satisficing relate to bias?

A metric is invalid if it doesn’t correspond to what the system must achieve or if it can’t be measured in a dependable way. The transcript contrasts measurable metrics with generative-image realism, which often requires subjective human judgment. For bias, satisficing means eliminating unacceptable performance gaps may matter more than maximizing overall accuracy. That requires datasets and tests that isolate relevant categories, then setting thresholds on performance differences between those categories and average performance.

TL;DR

Pick a single optimization metric early, but expect to revise it as constraints become clearer and performance improves.

Briefing Cornell Notes

Briefing

Choosing the right metric is the make-or-break decision that determines whether an ML project can be steered toward real-world usefulness. Because real systems involve trade-offs—accuracy, speed, error tolerance, fairness—engineers often need to compress multiple goals into a single number they can push up or down. That metric choice usually starts early, but it also changes as the model improves and as the team learns what constraints actually matter.

The transcript breaks down common classification metrics using a confusion matrix: accuracy measures the overall percentage of correct predictions; precision is the fraction of predicted positives that are truly positive (true positives divided by all predicted positives); recall is the fraction of actual positives that the model successfully finds (true positives divided by all actual positives). The key practical point is that different models can look better or worse depending on which metric is emphasized. When comparing several models with different precision/recall trade-offs, there’s no single “obvious” winner from raw numbers alone, so teams must decide how to combine or prioritize metrics.

One approach is to average metrics, which can make one model dominate even if its trade-offs are undesirable in deployment. Another common strategy is “optimize one, threshold the rest”: pick the primary metric to optimize, then enforce minimum or maximum acceptable levels for other constraints. For example, if accuracy is the priority but the system must also run within a time budget, engineers may optimize accuracy while requiring inference latency to stay under a set limit. Deciding which metrics to threshold versus optimize depends on domain judgment—what tolerances are acceptable—and on how sensitive each metric is to model changes. If one metric is already far from acceptable, it may be better to threshold the metrics that are already close to target and focus optimization on the worst one.

The transcript also highlights how to choose threshold values using baselines and metric importance. If a baseline model already performs well on a metric, the threshold can be set to demand improvement over that baseline. If a metric is critical to the downstream task, the threshold can be stricter. A concrete example uses a precision/recall rule: select the model with the best precision subject to recall exceeding 0.6, which can flip the “best model” compared with other selection rules.

For multi-class or trade-off-heavy settings, the discussion points to mean average precision (mAP), derived from the area under the precision-recall curve. mAP averages average precision across classes, capturing performance across varying decision thresholds rather than a single operating point.

The most detailed example ties metric selection to a real deployment target: real-time grasping. Requirements include less than one centimeter of position error, less than five degrees of angle error, and inference under 100 milliseconds. After training baseline models, the team compares current performance to these requirements: if angular error is around 60 degrees, position error is between 0.75 and 1.25 centimeters, and inference is 300 milliseconds, the rational next step is to prioritize the most violated constraint. The transcript describes thresholding position error at 1 centimeter to discard clearly inadequate models, temporarily ignoring runtime until accuracy is closer to feasible, and revisiting the metric strategy as angular error approaches the target.

In the Q&A, the transcript addresses “invalid metrics” as measures that either don’t correspond to what matters or can’t be measured reliably (e.g., judging generated image realism without a consistent human-evaluation process). It also covers satisficing: when bias elimination matters more than raw accuracy, teams need datasets and tests that can measure performance gaps across relevant categories, then set thresholds on those gaps. For human-in-the-loop systems and recommendation engines, reducing bias is framed as improving training data coverage for underperforming categories after identifying bias through targeted evaluation tests.

Cornell Notes

Metric selection turns messy real-world goals into something engineers can optimize. The transcript distinguishes accuracy, precision, and recall, then shows why model ranking changes depending on whether teams average metrics, optimize one metric while thresholding others, or use trade-off metrics like mean average precision (mAP). Thresholds should reflect domain tolerances, baseline performance, and which constraints are most violated. A grasping example demonstrates a staged strategy: discard models that miss key error limits, ignore runtime until accuracy is closer to feasible, then incorporate speed as the model improves. The Q&A adds that “invalid metrics” either don’t measure what matters or can’t be measured reliably, and that bias-focused work often requires satisficing via category-based tests and thresholds.

Why can’t teams rely on accuracy alone when precision and recall differ across models?

Accuracy mixes all outcomes, but precision and recall isolate different failure modes. Precision answers: when the model predicts “positive,” how often is it correct (true positives / all predicted positives). Recall answers: when the truth is “positive,” how often the model catches it (true positives / all actual positives). Two models can have similar accuracy while one produces many false positives (low precision) and the other misses many true positives (low recall), so the “best” choice depends on which error type is more costly in deployment.

What does “optimize one metric and threshold the rest” mean in practice?

Instead of collapsing everything into a single averaged score, teams pick a primary objective to optimize and enforce hard constraints on other metrics. Example: prioritize accuracy but require inference time to stay under a latency budget. The transcript also notes how to choose which metrics to threshold: use domain judgment about tolerances, prefer thresholding metrics that are already near acceptable, and optimize the metrics that are most sensitive to model choice or currently far from target.

How should threshold values be chosen rather than picked arbitrarily?

The transcript suggests using baseline performance and metric importance. If a baseline already meets a target, the threshold can be set to require improvement over that baseline. If a metric is critical to the downstream task, the threshold should be stricter. A specific example is selecting the model with the best precision subject to recall being at least 0.6—this forces the system to maintain coverage while still improving correctness of positive predictions.

What is mean average precision (mAP) and why does it matter for trade-offs?

mAP is built from the precision-recall curve: as recall increases, precision typically decreases. Average precision summarizes the area under that curve for a class, and mean average precision averages that value across classes. This matters because it evaluates performance across decision thresholds rather than locking the system to a single operating point.

How does the grasping example justify changing metrics over time?

Requirements include <1 cm position error, <5° angle error, and <100 ms inference. Baselines show angular error around 60°, position error roughly 0.75–1.25 cm, and inference around 300 ms. With angular error far off, the strategy prioritizes bringing the most violated constraint closer to target—e.g., thresholding position error at 1 cm to eliminate clearly inadequate models, ignoring runtime temporarily because accuracy is still too poor, then revisiting latency once errors improve.

What makes a metric “invalid,” and how does satisficing relate to bias?

A metric is invalid if it doesn’t correspond to what the system must achieve or if it can’t be measured in a dependable way. The transcript contrasts measurable metrics with generative-image realism, which often requires subjective human judgment. For bias, satisficing means eliminating unacceptable performance gaps may matter more than maximizing overall accuracy. That requires datasets and tests that isolate relevant categories, then setting thresholds on performance differences between those categories and average performance.

Review Questions

Give an example of how two models could swap order depending on whether the team optimizes precision or recall.
In the grasping scenario, what decision rule would you use to prioritize which metric to optimize next, and why?
How would you design a metric and dataset to measure bias in a recommendation engine without relying on overall accuracy alone.

Key Points

1
Pick a single optimization metric early, but expect to revise it as constraints become clearer and performance improves.
2
Use precision and recall to separate different error costs: false positives (precision) versus missed positives (recall).
3
When multiple goals conflict, optimize one metric while thresholding others that must meet hard deployment constraints.
4
Choose thresholds using domain tolerances, baseline performance, and how critical each metric is to the downstream task.
5
For trade-off-heavy tasks, consider mean average precision (mAP) because it summarizes performance across the precision-recall curve.
6
Avoid “invalid metrics” that either don’t measure what matters or can’t be measured reliably.
7
Bias-focused work often requires satisficing: evaluate category-level performance gaps and enforce thresholds rather than maximizing overall accuracy.

Highlights

Precision and recall target different kinds of mistakes, so model rankings can change dramatically depending on which metric is optimized.

“Optimize one, threshold the rest” turns deployment constraints like latency into enforceable requirements while still improving the main objective.

Mean average precision (mAP) averages average precision across classes using the area under precision-recall curves.

The grasping example uses a staged metric strategy: fix the most violated requirement first (e.g., angular error), then incorporate runtime once accuracy is closer to feasible.

Bias elimination is treated as a measurement-and-threshold problem using category-based datasets and tests, not just as an accuracy improvement task.

Topics

Metric Selection
Precision and Recall
Thresholding Strategies
Mean Average Precision
Bias and Satisficing