Managing (4) - ML Teams - Full Stack Deep Learning

TL;DR

ML projects often show early performance gains that don’t persist, so week-to-week progress is a weak predictor of completion.

Briefing Cornell Notes

Briefing

Managing machine-learning teams is hard largely because progress is unpredictable: early gains often don’t translate into sustained improvement, and projects can stall or even regress for weeks. A competition example shows accuracy jumping from roughly 30% to 70% in the first week, yet over the following months performance barely improves. Participation patterns reinforce the lesson—teams keep joining rather than quitting after early success—suggesting that “we made big progress this week” is not a reliable predictor of “we’ll finish next week.” In practice, ML work frequently behaves nonlinearly: teams hit a plateau, try multiple approaches that fail to move the needle, then get stuck behind technical debt that forces refactors and can leave performance worse than before the detour.

A second management challenge comes from culture clashes inside ML organizations, where research and engineering often collaborate. Researchers tend to prioritize novelty—new model ideas, experiments, and polished write-ups—while engineers prioritize clean, scalable systems and maintainability. When these value sets don’t align, organizations fall into a familiar failure mode: researchers see engineering as trivial, engineers see research as impractical, and neither side fully respects the other’s constraints. The result is friction over what “good work” looks like and what timelines should assume.

Third, leadership frequently lacks ML fluency, making it difficult to set realistic expectations, interpret results, or understand why ML differs from traditional software delivery. That gap matters because ML planning and reporting require different mental models than standard engineering project management.

To manage better, ML project planning should be probabilistic rather than deterministic. Traditional waterfall planning assumes tasks have known durations and dependencies; ML planning must explicitly account for uncertainty in which ideas will work. Instead of committing to a single path, teams should run parallel hypotheses that all map to the same objective, acknowledging that some approaches may have only a 50% chance of success. As weeks pass, the plan should change based on what becomes impossible (e.g., data quality undermines an approach) or what fails to deliver expected performance (e.g., a model underperforms until a new architecture is tried).

A corollary is to avoid “path-critical” research where one fragile assumption blocks the objective. Good teams often run multiple hypotheses in parallel as a friendly competition of ideas, but they need a culture focused on reaching the right answer—not rewarding who guessed correctly. Progress measurement should reflect inputs and execution quality (what was attempted, how well it was executed) rather than only whether an individual idea ultimately worked.

Operationally, getting end-to-end ML pipelines into production quickly helps create a data flywheel and ties model behavior to business value, making progress easier to communicate credibly. Finally, status updates to leadership should be business-specific and cautious about forward-looking promises. Reporting “cat detection accuracy from 60% to 80%” without translating it into business impact, or projecting that “next week we’ll solve it,” can mislead stakeholders because ML progress can stall or reverse. The broader takeaway: ML teams need probabilistic planning, aligned research-engineering culture, production feedback loops, and leadership education to manage expectations and communicate progress responsibly.

Cornell Notes

ML team management is difficult because ML progress is nonlinear and uncertain: early performance jumps may not continue, and projects can stall or regress due to failed experiments or technical debt. Effective planning treats hypotheses probabilistically, running multiple approaches in parallel that all aim at the same objective while updating timelines as evidence shows what’s impossible. Strong teams avoid path-critical research by maintaining backup hypotheses and fostering a culture that rewards finding the right solution, not just being first. Progress measurement should emphasize inputs and execution quality rather than only whether a specific idea worked. Fast end-to-end pipelines in production create feedback loops that connect model changes to business value and make leadership communication more reliable.

Why is it risky to assume that a big week of accuracy gains means the project will finish soon?

Because ML improvement often concentrates early and then flattens. In the competition example, accuracy rose sharply in the first week (about 30% to 70%), but over the next three months it barely moved. The number of participating teams also stayed high rather than collapsing after early gains, implying many teams still faced uncertainty about what would work next. The practical takeaway is that week-to-week gains don’t reliably predict future progress in ML.

What does “probabilistic planning” look like for ML projects?

Instead of a fixed dependency timeline like in traditional engineering, ML plans incorporate uncertainty about which ideas will succeed. A team might plan for a target task (e.g., Task D) but pursue multiple candidate routes (Tasks A, B, C) with explicit success uncertainty (e.g., Task A only has a 50% chance). After each week, the plan is revised based on evidence—data may be worse than expected, an architecture may underperform, or new experiments may become necessary.

What is meant by avoiding “path-critical research projects”?

It means the team should not rely on a single fragile research assumption to reach the objective. If meeting the goal depends on something uncertain, the team should maintain multiple hypotheses that could achieve the same outcome. Many strong teams run several ideas in parallel, treating them as a friendly competition of approaches that all map to the same end goal.

How should progress be measured for individuals on an ML team?

Measure inputs and execution quality rather than only whether an idea ultimately worked. For example, evaluate what a person set out to try, whether they executed the experiment well, and how effectively they learned from results. Over long periods, production outcomes matter, but on shorter horizons tying performance solely to success can distort behavior and discourage honest experimentation.

Why do end-to-end pipelines in production matter for management?

They increase the odds of success by enabling a data flywheel: model predictions can be compared against real business value and updated accordingly. They also make progress easier to communicate to leadership because results reflect operational impact, not just lab metrics.

What’s wrong with typical ML status updates like “accuracy went up, so next week we’ll solve it”?

They often lack business specificity and make dangerous forward-looking projections. A change from 60% to 80% accuracy may sound impressive but doesn’t tell leadership what that means for the business. Predicting that the problem will be solved next week based on one week of results can backfire because ML projects can stall, and stakeholders may hold the team to those projections.

Review Questions

How would you redesign a deterministic ML project plan into a probabilistic one, and what evidence would trigger replanning?
What incentives and metrics would you use to prevent a research-engineering culture from drifting into “research vs engineering” blame?
Give an example of a leadership-facing ML status update that is both business-specific and avoids unreliable forward-looking claims.

Key Points

1
ML projects often show early performance gains that don’t persist, so week-to-week progress is a weak predictor of completion.
2
Nonlinear behavior is common in ML: teams can plateau for weeks or lose ground due to technical debt and refactoring.
3
Research and engineering frequently operate with different value systems; aligning incentives and respect is essential for collaboration.
4
ML planning should be probabilistic, explicitly accounting for uncertainty and revising timelines as experiments succeed or fail.
5
Avoid path-critical research by maintaining multiple hypotheses that can reach the same objective, often through parallel exploration.
6
Measure individual progress by inputs and execution quality, not only by whether a specific idea worked.
7
Communicate ML progress to leadership in business terms and avoid confident forward-looking promises based on short-term metric changes.

Highlights

A competition accuracy jump from ~30% to ~70% in one week didn’t translate into sustained improvement over months, illustrating why early gains can mislead planning.

ML progress can stall entirely for weeks, and technical debt can force refactors that temporarily reduce performance.

Probabilistic planning replaces fixed timelines with uncertainty-aware hypotheses and replanning as evidence accumulates.

Fast end-to-end pipelines in production create a data flywheel and make progress easier to justify to leadership.

Status updates framed as raw accuracy changes plus “we’ll solve it next week” are risky because ML outcomes can stall or reverse.

Topics

ML Team Management
Probabilistic Planning
Research vs Engineering
Production ML
Leadership Communication

Mentioned

Lukas Biewald