Orgs (3) - ML Teams - Full Stack Deep Learning

TL;DR

Companies evolve through multiple ML organizational structures, and there’s no single consensus “right” design yet.

Briefing Cornell Notes

Briefing

Machine-learning organizations don’t have a single “correct” structure yet, but companies tend to evolve through a recognizable ladder: from ad hoc experimentation, to ML embedded in product teams, to a centralized ML function, and finally to an “ML-first” model where AI expertise and infrastructure permeate the whole company. The practical payoff of moving up the ladder is better access to data and talent, faster iteration, and smoother deployment—but each step also introduces new bottlenecks around resources, ownership, and operational handoffs.

At the base of the “ML organization mountain” sit companies where machine learning is mostly informal—sometimes a few enthusiasts, but no dedicated function. The upside is that teams can often find “low-hanging fruit” by spotting business problems that simple predictive models can improve quickly. The downside is structural: limited infrastructure, limited compute, and weak leadership buy-in. Those gaps matter because ML projects often run on different timelines than typical software work, and it can be hard to attract ML talent without a real support system.

A common next move is to place ML inside R&D. In this setup, researchers—often PhD-heavy—work with business data and produce models and sometimes papers, while enjoying more freedom from near-term product deadlines. This can appeal to researchers who prefer building models over worrying about what happens after deployment. But the model often stalls: R&D groups struggle to obtain data from business units, and without visible business wins, investment stays small.

Many organizations then embed ML engineers directly into product or business teams. This arrangement creates a clear line from ML work to customer-facing improvements and enables rapid feedback loops: prototype, test, iterate. It also tends to unlock more funding because results can be tied to product metrics. Yet dispersing ML across the organization makes it harder to build “ML as a function”—including hiring and developing top ML talent who want to collaborate with other ML specialists. It also strains resources like data and compute, and engineering leaders may push back when ML delivery doesn’t match engineering’s expectations.

To address those issues, some companies build an independent ML function. Centralization increases talent density, supports tooling and deployment practices, and—because the group often reports to senior leadership—can help break down data-access barriers. The tradeoff is handoff friction: centralized teams must deliver models to business users who may not have the expertise to know when models apply, how to monitor them, or how to operate them responsibly.

The end state is an ML-first organization, where leadership is fully committed, a centralized ML division tackles the hardest problems, and every business unit has ML capability built in. Google and Facebook are cited as examples, along with ML-focused startups. This structure aims to combine the best of both worlds—data access plus deployment and talent development—but it’s difficult to implement, expensive to staff, and culturally hard to shift.

The transcript also highlights key design choices that determine how teams work: whether ML teams prioritize software engineering versus research, how much control they have over data ownership and pipelines, and whether they deploy and maintain models or hand them off. Embedded teams often prioritize production code and work closely with data engineers; R&D teams often have less data control and focus on research; ML-first organizations can support shared understanding between research and engineering and may take responsibility for company-wide data infrastructure and model operation. The discussion closes by weighing career fit: people motivated by improving products tend to thrive embedded in business teams, while those driven by state-of-the-art modeling and large-scale tooling often prefer centralized ML roles. Bias and fairness are also flagged as an increasingly important responsibility as ML systems become more mature.

Cornell Notes

Companies typically climb an “ML organization mountain” rather than adopting one universal structure. Early stages rely on ad hoc ML with limited support, then shift to R&D-based research, then to ML embedded in product teams for faster business impact. A centralized ML function increases talent density and investment in tooling, but it often struggles with model handoffs to business users who may not know how to operate models. The final “ML-first” state combines centralized expertise with ML capability across business units, improving data access and deployment—yet it’s culturally and operationally hard to achieve. Key structural decisions—software vs research focus, data ownership, and model deployment/maintenance—largely determine which model works best for a given organization.

Why do companies often struggle at the “ad hoc ML” stage even when they find low-hanging fruit?

Ad hoc ML can identify quick wins because simple predictive models may address clear business problems. But projects stall when infrastructure and compute are missing, leadership buy-in is weak, and ML timelines don’t match standard project rhythms. Talent acquisition also becomes difficult when ML isn’t supported as a real function.

What’s the main failure mode of putting ML primarily in R&D?

R&D teams can hire experienced researchers and work on longer-term problems, but they often can’t reliably get data from business units. If business teams don’t see value from the models, they resist sharing data and collaborating. That keeps investment small and limits visible business impact.

What are the tradeoffs of embedding ML engineers inside product or business teams?

Embedding creates a direct connection to business value and short feedback loops: ML engineers can prototype quickly and measure product improvements. However, dispersing ML makes it hard to build a strong ML function—top ML hires may want dense ML communities, and ML engineers can become isolated. Engineering leaders may also conflict with ML delivery cycles, and it can be harder to justify longer, higher-effort ML investments.

How does a centralized ML function improve outcomes, and what new problem does it create?

Centralization increases talent density, enabling better hiring, training, and investment in tooling and deployment practices. Senior reporting lines can also improve access to data. The new challenge is handoff: centralized teams must deliver models to business users who may not have the mental frameworks to judge where models apply, how to monitor them, or how to operate them.

What structural choices determine whether an ML team can succeed across these organizational stages?

Three design choices recur: (1) software engineering vs research emphasis—whether the team builds/ships products or mainly trains models; (2) data ownership—whether ML controls data collection/labeling/pipelines or receives packaged datasets; and (3) model ownership—whether the ML team deploys and maintains models or hands them off. These choices shape collaboration needs with data engineering and the feasibility of reliable deployment.

How does “ML-first” differ from earlier centralized or embedded models?

ML-first permeates the organization: leadership is fully committed, a central ML division tackles hard problems, and ML expertise exists within each business unit. The goal is to combine centralized strengths (data access, deployment capability, mentoring) with embedded strengths (close-to-business deployment and usage). The transcript notes that few organizations reach this due to staffing, cultural change, and implementation difficulty.

Review Questions

Which organizational stage best matches a company that needs rapid product iteration, and what resource or talent risks come with that stage?
How do data ownership and model ownership choices affect whether ML teams can reliably deploy and maintain models?
Why might centralized ML teams face difficulties even when they have strong talent density and senior leadership access to data?

Key Points

1
Companies evolve through multiple ML organizational structures, and there’s no single consensus “right” design yet.
2
Ad hoc ML can find quick wins, but limited compute/infrastructure and weak buy-in often block scaling.
3
R&D-based ML can attract experienced researchers and support longer-term work, but it often fails when business units won’t share data or see value.
4
Embedding ML in product teams speeds feedback and ties work to business outcomes, but it can weaken ML as a function and create delivery-cycle conflicts.
5
Centralized ML functions improve talent density and tooling investment, yet they introduce handoff problems when business users can’t operate models.
6
ML-first organizations aim to combine centralized expertise with ML capability across business units, but they require major cultural and staffing shifts.
7
Team design choices—software vs research focus, data ownership, and model deployment/maintenance responsibility—largely determine how well each structure works.

Highlights

The “ML organization mountain” frames a practical progression: ad hoc → R&D → embedded product teams → centralized ML → ML-first.

Centralization boosts talent density and tooling, but model handoffs become a critical operational bottleneck.

Embedding ML in product teams creates fast feedback and clearer ROI, yet it can isolate ML talent and clash with engineering delivery expectations.

Data ownership and model ownership decisions are as important as org charts for whether ML systems can reach production reliably.

Topics

Machine Learning Teams
Organizational Structure
Centralized vs Embedded ML
Data Ownership
Model Deployment Ownership