6. Baselines - ML Projects - Full Stack Deep Learning

TL;DR

Baselines provide a lower bound on expected performance, and tighter bounds make them more actionable.

Briefing Cornell Notes

Briefing

Baselines act as a reality check for model performance by setting a lower bound on what a system can achieve. The tighter that lower bound, the more useful it becomes for deciding what to do next. Without a baseline, teams can misread progress: two models with identical training and validation curves can imply very different levels of success depending on where the baseline sits. If a blue curve representing “human performance” is near 20%, then a model that looks close to that curve suggests one kind of gap; if human performance is closer to 5%, the same model signals a much larger shortfall and different improvement priorities.

Baselines matter because they anchor interpretation. Error curves alone show relative movement, but they don’t tell whether the absolute level is good enough to justify further effort. That distinction drives downstream decisions—whether to invest in better architectures, more data, different labeling strategies, or entirely new approaches. In practice, teams can end up making the wrong call if they skip baselines and jump straight into training a sophisticated model (for example, a ResNet with aggressive data augmentation) under a tight deadline.

Where to find baselines depends on what resources are available. External sources include business or engineering requirements and published results, but comparisons must be fair: if the team’s dataset is harder than the one used in a paper, expecting the same performance is unrealistic. Internal baselines can be created quickly by scripting the problem using tools like OpenCV, building rule-based systems, or using simpler machine learning models such as bag-of-words or linear regression. The expectation is straightforward: a deep learning model should perform at least as well as these simpler references.

Human performance is another baseline, but it comes with a trade-off between data quality and ease of collection. Random labeling from people on Amazon Mechanical Turk is easy but often noisy. Quality improves by ensembling multiple random annotators—using agreement or majority vote rather than a single person’s answer. Domain experts raise the ceiling further, though they cost more time and effort. A similar ensemble approach can also be used among groups of experts.

Choosing where to land on that quality-versus-cost curve comes down to what the team will need next. The guidance is to use the highest-quality baseline that still allows iteration—especially because model improvement typically requires collecting more labeled data later. One practical tactic is to reserve the time of the most expensive labelers for the hardest examples, concentrating skilled effort where it yields the most information.

Finally, baselines support faster debugging and more disciplined decision-making. Teams can sometimes skip spending time on whether a project is worth doing or feasible when that decision is already made, but skipping baselines is riskier because it removes the reference point needed to interpret results. The discussion also notes a broader performance strategy: if feasibility is uncertain, it’s often better to first prove the task can be solved at all with a larger model, then meet speed constraints later through methods like pruning and knowledge distillation.

Cornell Notes

Baselines provide a lower bound on expected model performance, making it possible to judge whether results are truly good in absolute terms. The usefulness of a baseline depends on how tight that bound is: the same training/validation curves can imply very different gaps if human or other baselines differ. Teams can build baselines from external sources (requirements or published results), internal scripted/rule-based systems, simpler ML models (bag-of-words, linear regression), or human performance. Human baselines trade off quality and cost: random annotators are easy but noisy, while ensembling annotators and using domain experts improves quality. The recommended approach is to use the highest-quality baseline feasible, and to spend scarce expert labeling time on the hardest examples.

Why can two identical-looking error curves still lead to different improvement plans?

Because the interpretation depends on the baseline’s absolute level. If human performance is near 20%, a model close to that curve suggests one kind of remaining gap; if human performance is near 5%, the same model indicates a much larger shortfall. That changes what “next step” makes sense—whether to focus on data, architecture, or labeling quality.

What makes a baseline “useful” rather than just “present”?

A baseline is most useful when it is a tight lower bound on expected performance. The tighter the bound, the more accurately it anchors decisions about whether the model is close enough to the target to justify further work, and it improves debugging by providing a reference point for what “good” looks like.

What are practical sources for baselines, and how should comparisons be handled?

Teams can use external requirements, published results, internal scripted/rule-based baselines (including OpenCV-based pipelines), simpler ML models like bag-of-words or linear regression, and human performance. For published results, comparisons must be fair: if the team’s dataset is harder, expecting the same performance is misleading.

How is human performance baseline quality improved, and what are the trade-offs?

Random labeling via Amazon Mechanical Turk is easy but noisy. Quality improves by ensembling multiple annotators and using agreement/majority vote. Domain experts yield higher-quality baselines but are harder to recruit. Ensembling among experts can further improve reliability, but it increases cost and coordination effort.

How should teams decide how much effort to spend on baseline quality?

Use the highest-quality baseline that still supports iteration. If expert time is limited and the team will need to collect more labels later, spending all expert effort up front may be counterproductive. A recommended compromise is to concentrate expensive expert labeling on the hardest examples, where additional skill most improves the baseline and subsequent model training.

When is it reasonable to skip parts of baseline work under a deadline?

If feasibility or “worth doing” decisions are already made, teams can skip that earlier decision step. But skipping baselines entirely is discouraged because baselines make debugging more efficient and help interpret whether model changes are actually moving toward an acceptable absolute performance level.

Review Questions

How does changing the baseline (e.g., human performance level) alter the interpretation of the same training/validation curves?
List at least three ways to construct baselines and explain one fairness or quality pitfall for each.
What strategy can reduce the cost of high-quality human baselines while still improving their usefulness?

Key Points

1
Baselines provide a lower bound on expected performance, and tighter bounds make them more actionable.
2
Absolute performance judgments require baselines; error curves alone can mislead when human or other reference levels differ.
3
External baselines from published results must be compared fairly, especially when datasets differ in difficulty.
4
Internal baselines can be built quickly using scripted pipelines (e.g., OpenCV), rule-based systems, or simpler models like bag-of-words and linear regression.
5
Human baselines improve in quality through ensembling and by using domain experts, but each step increases cost and collection complexity.
6
The best baseline is the highest-quality one that still allows iteration, often by spending expert labeling time on the hardest examples.
7
Skipping baselines under tight deadlines can slow debugging and lead to misguided model choices, even if sophisticated architectures are tempting.

Highlights

Two models with the same curves can imply opposite conclusions if the baseline (like human performance) sits at a different absolute level.

Baselines aren’t just for benchmarking—they speed up debugging by clarifying what “good” means in absolute terms.

Random human labels from Amazon Mechanical Turk are easy but noisy; ensembling and expert input raise baseline quality.

A practical cost-control tactic is to allocate expert time to the hardest examples rather than labeling everything at expert level.

A common strategy for uncertain feasibility is to first prove the task works with a larger model, then meet speed constraints via pruning/distillation.

Topics

Model Baselines
Human Performance
Debugging
Labeling Strategy
Model Compression