Lukas Biewald on Founding Weights & Biases and FigureEight (Full Stack Deep Learning - March 2019)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Machine learning deployment requires provenance—knowing how a model was trained, with what data, and what that data contained—because code review alone can’t ensure reliability.
Briefing
Deep learning’s real bottleneck isn’t model architecture—it’s the messy, high-stakes work of turning training into reliable production systems. Lucas Biewald, founder of weights and biases and former founder of figure eight (formerly CrowdFlower), argues that most companies fail not because they can’t build neural nets, but because they treat machine learning like software engineering: they can’t “step through” an ML system, small changes in training files can produce unpredictable shifts, and teams often lack the provenance needed to trust what a model learned.
Biewald frames the shift from research to deployment as a provenance problem. For safety-critical uses—self-driving, medical imaging, or anything where a wrong output has consequences—code review isn’t enough. Teams need to know how a model was trained, what data it used, and what that data looked like. He points to the difficulty of predicting performance curves: early gains on a dataset can flatten, leaving teams stuck at “good but unusable” accuracy. His example comes from a Kaggle-style cattle competition he ran, where accuracy jumped quickly at first and then stalled, illustrating how hard it is to extrapolate toward a “95%+” finish line.
Despite the cautionary tone, he insists there’s plenty people can do right now. He highlights everyday successes of deep learning: real-time image and speech recognition that didn’t work reliably in the past, brand use cases like detecting products in social media images, and high-impact deployments ranging from skin cancer classification to satellite imagery and counterfeit detection. He also cites industrial automation examples—robots checking shelf stocking, and precision agriculture systems that target individual weeds instead of blanket pesticide spraying—arguing that “adding smarts” can improve both efficiency and environmental outcomes.
A recurring theme is that ML performance tracks data availability more than cleverness. He connects this to historical patterns: breakthroughs often arrive soon after large, relevant datasets become available, and scaling training data can keep improving results even when algorithms are already strong. He also emphasizes data hygiene: mislabeled or surprising outliers can dominate error because they carry high residuals, so cleaning and correctly labeling the hardest cases often beats chasing fancier methods.
Biewald also warns that models generalize poorly when deployment differs from training. His own robot experiments showed that a model that looked strong on ImageNet could fail in the real world, with known issues like camera framing differences. He uses examples of adversarial vulnerability—tiny perturbations that can fool vision systems—and notes that the same weakness becomes dangerous in domains like autonomous driving, where small errors can be catastrophic.
Finally, he offers a practical playbook for shipping: pick training data carefully, get an end-to-end system working early, improve iteratively, and then confront failure cases systematically. He recommends inspecting the model’s highest-residual examples (often mislabels or unexpected edge cases) and using human-in-the-loop workflows where confidence drives whether humans review outputs. The goal is not just accuracy on paper, but a feedback loop that makes the system safer and better over time.
Cornell Notes
Lucas Biewald argues that deep learning succeeds in the real world only when teams treat ML as a production discipline, not just a modeling exercise. Most failures come from missing provenance (how the model was trained, with what data) and from unpredictable behavior when training data changes—unlike typical software where code diffs are easier to reason about. He stresses that performance often scales with relevant training data, and that data cleaning and correct labeling of hard outliers can matter more than algorithm tweaks. He also highlights generalization gaps between benchmarks (like ImageNet) and deployment, plus adversarial and safety risks. The practical takeaway: ship end-to-end early, then iteratively diagnose failure cases using residuals and human-in-the-loop confidence workflows.
Why does Biewald say ML production is fundamentally different from software engineering?
What does the “accuracy curve flattening” example teach about forecasting ML progress?
How does Biewald connect real-world ML success to data availability and data quality?
Why can a model that performs well on ImageNet still fail on a robot?
What safety and security risks does he highlight for deployed deep learning systems?
What concrete workflow does he recommend for shipping and improving deep learning systems?
Review Questions
- What does “provenance” mean in the context of ML deployment, and why can’t code inspection alone provide it?
- How do residuals and outliers help diagnose why an ML system is failing, and what kinds of issues do they often reveal?
- Why does scaling training data sometimes outperform algorithmic sophistication, according to Biewald’s examples?
Key Points
- 1
Machine learning deployment requires provenance—knowing how a model was trained, with what data, and what that data contained—because code review alone can’t ensure reliability.
- 2
ML behavior can change unpredictably when training data changes, so ML engineering needs stronger versioning, testing, and workflow discipline than many teams apply.
- 3
Performance forecasting is unreliable: early accuracy gains can flatten, leaving teams stuck below thresholds needed for real use.
- 4
Generalization breaks when deployment inputs differ from benchmark conditions; models trained on curated datasets (like ImageNet) may fail under real camera framing and environment shifts.
- 5
Data quality often beats model tweaks: mislabeled or surprising outliers can dominate error, so cleaning and correctly labeling hard cases can deliver major gains.
- 6
Adversarial vulnerability is a practical safety concern; small perturbations can flip predictions in ways that matter for high-stakes systems.
- 7
A workable shipping strategy is end-to-end early, iterative improvement, and systematic failure handling using residual inspection and human-in-the-loop confidence workflows.