Orgs (3) - ML Teams - Full Stack Deep Learning
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Companies evolve through multiple ML organizational structures, and there’s no single consensus “right” design yet.
Briefing
Machine-learning organizations don’t have a single “correct” structure yet, but companies tend to evolve through a recognizable ladder: from ad hoc experimentation, to ML embedded in product teams, to a centralized ML function, and finally to an “ML-first” model where AI expertise and infrastructure permeate the whole company. The practical payoff of moving up the ladder is better access to data and talent, faster iteration, and smoother deployment—but each step also introduces new bottlenecks around resources, ownership, and operational handoffs.
At the base of the “ML organization mountain” sit companies where machine learning is mostly informal—sometimes a few enthusiasts, but no dedicated function. The upside is that teams can often find “low-hanging fruit” by spotting business problems that simple predictive models can improve quickly. The downside is structural: limited infrastructure, limited compute, and weak leadership buy-in. Those gaps matter because ML projects often run on different timelines than typical software work, and it can be hard to attract ML talent without a real support system.
A common next move is to place ML inside R&D. In this setup, researchers—often PhD-heavy—work with business data and produce models and sometimes papers, while enjoying more freedom from near-term product deadlines. This can appeal to researchers who prefer building models over worrying about what happens after deployment. But the model often stalls: R&D groups struggle to obtain data from business units, and without visible business wins, investment stays small.
Many organizations then embed ML engineers directly into product or business teams. This arrangement creates a clear line from ML work to customer-facing improvements and enables rapid feedback loops: prototype, test, iterate. It also tends to unlock more funding because results can be tied to product metrics. Yet dispersing ML across the organization makes it harder to build “ML as a function”—including hiring and developing top ML talent who want to collaborate with other ML specialists. It also strains resources like data and compute, and engineering leaders may push back when ML delivery doesn’t match engineering’s expectations.
To address those issues, some companies build an independent ML function. Centralization increases talent density, supports tooling and deployment practices, and—because the group often reports to senior leadership—can help break down data-access barriers. The tradeoff is handoff friction: centralized teams must deliver models to business users who may not have the expertise to know when models apply, how to monitor them, or how to operate them responsibly.
The end state is an ML-first organization, where leadership is fully committed, a centralized ML division tackles the hardest problems, and every business unit has ML capability built in. Google and Facebook are cited as examples, along with ML-focused startups. This structure aims to combine the best of both worlds—data access plus deployment and talent development—but it’s difficult to implement, expensive to staff, and culturally hard to shift.
The transcript also highlights key design choices that determine how teams work: whether ML teams prioritize software engineering versus research, how much control they have over data ownership and pipelines, and whether they deploy and maintain models or hand them off. Embedded teams often prioritize production code and work closely with data engineers; R&D teams often have less data control and focus on research; ML-first organizations can support shared understanding between research and engineering and may take responsibility for company-wide data infrastructure and model operation. The discussion closes by weighing career fit: people motivated by improving products tend to thrive embedded in business teams, while those driven by state-of-the-art modeling and large-scale tooling often prefer centralized ML roles. Bias and fairness are also flagged as an increasingly important responsibility as ML systems become more mature.
Cornell Notes
Companies typically climb an “ML organization mountain” rather than adopting one universal structure. Early stages rely on ad hoc ML with limited support, then shift to R&D-based research, then to ML embedded in product teams for faster business impact. A centralized ML function increases talent density and investment in tooling, but it often struggles with model handoffs to business users who may not know how to operate models. The final “ML-first” state combines centralized expertise with ML capability across business units, improving data access and deployment—yet it’s culturally and operationally hard to achieve. Key structural decisions—software vs research focus, data ownership, and model deployment/maintenance—largely determine which model works best for a given organization.
Why do companies often struggle at the “ad hoc ML” stage even when they find low-hanging fruit?
What’s the main failure mode of putting ML primarily in R&D?
What are the tradeoffs of embedding ML engineers inside product or business teams?
How does a centralized ML function improve outcomes, and what new problem does it create?
What structural choices determine whether an ML team can succeed across these organizational stages?
How does “ML-first” differ from earlier centralized or embedded models?
Review Questions
- Which organizational stage best matches a company that needs rapid product iteration, and what resource or talent risks come with that stage?
- How do data ownership and model ownership choices affect whether ML teams can reliably deploy and maintain models?
- Why might centralized ML teams face difficulties even when they have strong talent density and senior leadership access to data?
Key Points
- 1
Companies evolve through multiple ML organizational structures, and there’s no single consensus “right” design yet.
- 2
Ad hoc ML can find quick wins, but limited compute/infrastructure and weak buy-in often block scaling.
- 3
R&D-based ML can attract experienced researchers and support longer-term work, but it often fails when business units won’t share data or see value.
- 4
Embedding ML in product teams speeds feedback and ties work to business outcomes, but it can weaken ML as a function and create delivery-cycle conflicts.
- 5
Centralized ML functions improve talent density and tooling investment, yet they introduce handoff problems when business users can’t operate models.
- 6
ML-first organizations aim to combine centralized expertise with ML capability across business units, but they require major cultural and staffing shifts.
- 7
Team design choices—software vs research focus, data ownership, and model deployment/maintenance responsibility—largely determine how well each structure works.