From 1T Tokens to ZB Scale—How to move past the internet and scale LLMs
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Scaling beyond ~1T tokens hinges on improving *useful* training data quality, not just increasing token counts.
Briefing
The central claim is that today’s “trillion-token” training regimes hit a ceiling—not because data is impossible to obtain, but because models need *useful* data, not raw volume. Moving from ~1T tokens toward 10T and beyond requires a scaling framework focused on multilingual coverage, targeted expansion of underrepresented text, and stronger automated curation—so the training set better reflects how people actually communicate worldwide.
The proposed path to 10T tokens starts with multilingual training data that shifts away from an English-dominant corpus. Instead of 80–90% English, the target is closer to 30–40% English, with substantial representation from languages such as Chinese, Hindi, French, and Spanish. That shift is paired with “focused curated expansion” of historically underrepresented sources, plus tooling upgrades: automated cleaning, duplication pipelines that work across diverse languages and formats, and more flexible tokenizers that can adapt to morphological differences. The payoff is practical: models should feel more natural in languages beyond English, because current systems often underperform in languages like Indonesian when their training data is skewed toward English.
To go from 10T to 100T tokens, the data strategy broadens from curated text snapshots into continuous ingestion. The blueprint includes real-time web streams, social media feeds, large-scale transcription of podcasts and phone calls (with permission), and multimodal inputs—especially vision tokens. The jump also implies a major infrastructure step-change: immense compute, storage, and network capacity, plus transformer architectures that can unify text, images, and video within a single scalable framework. If models can be trained on continuous feeds at that scale, the door opens to continuously updated models that stay current on world events.
Beyond 100T tokens, the transcript sketches a further ladder: quadrillion-scale training would pull in sensor logs tied to robotics and embodied experience—camera streams, tactile sensors, motor commands, and Internet of Things data—compressed into semantically meaningful tokens. That requires systems that learn from interactions and unlabeled sensor streams, plus much larger GPU/TPU clusters. The farthest stage, zettabyte-scale (10^20 tokens), shifts toward city-scale and industry-wide data: 3D scans, full audio streams, simulation logs, and standardized real-time sensor data across sectors like healthcare, manufacturing, agriculture, and autonomous vehicles.
Yet the argument doesn’t treat zettabyte scale as a prerequisite for intelligence. The transcript suggests that meaningful progress toward general intelligence may come sooner through “reasoning scale,” where test-time compute can boost capability without touching training-data volume. It also notes a practical reality: current models are already “good enough” for many enterprise tasks (e.g., product requirements documents), even if they fall short for general-purpose work.
Finally, the discussion frames the race as economically and operationally binding. Scaling data and compute is expensive, so model makers may prefer to “rent” data-center capacity (a theme tied to Azure) rather than build everything in-house. Meanwhile, distillation—turning large models into smaller, cheaper ones—lets value propagate globally, but it also means competitive advantage depends on maintaining an edge through continual training and data scaling. The transcript ends by linking later-stage scaling to hardware ecosystems, with a strong suggestion that Nvidia’s GPU and robotics push is well-positioned for the sensor-heavy future.
In short: the next leap isn’t just “more tokens,” but a structured move toward multilingual, multimodal, continuously ingested, sensor-enriched data—tempered by the possibility that reasoning compute can accelerate capability without waiting for zettabyte-scale training.
Cornell Notes
The transcript argues that scaling LLMs past today’s ~1T-token training regimes depends on *useful* data, not just more data. The path to 10T tokens emphasizes multilingual coverage (shifting from English-heavy corpora toward a more global language mix), curated expansion of underrepresented sources, and improved automation for cleaning, deduplication, and tokenization across languages. Reaching 100T tokens requires continuous ingestion—real-time web, social streams, permissioned transcription, and multimodal vision tokens—plus transformer architectures that unify text and images at scale. For quadrillion and zettabyte regimes, sensor logs, robotics, and city/industry-scale standardized data become central, though reasoning scale may deliver major gains earlier without waiting for extreme token volumes.
Why does “useful data” matter more than raw token counts when scaling LLMs?
What concrete changes are proposed to move from ~1T tokens to ~10T tokens?
How does the strategy shift from 10T tokens to 100T tokens?
What data sources become central at quadrillion and zettabyte token scales?
Why might extreme token scale (zettabytes) not be necessary for general intelligence?
Review Questions
- Which specific data-management upgrades (cleaning, deduplication, tokenization) are presented as prerequisites for multilingual scaling to 10T tokens?
- What multimodal and continuous-ingestion sources are listed as key ingredients for reaching 100T tokens?
- How does “reasoning scale” function as an alternative path to capability growth compared with waiting for zettabyte-scale training data?
Key Points
- 1
Scaling beyond ~1T tokens hinges on improving *useful* training data quality, not just increasing token counts.
- 2
Moving toward ~10T tokens requires a multilingual shift—roughly 30–40% English rather than 80–90%—plus curated expansion of underrepresented texts.
- 3
Automated cleaning, cross-language duplication pipelines, and more flexible tokenizers are treated as essential for multilingual data at scale.
- 4
Reaching ~100T tokens depends on continuous ingestion (web, social, permissioned transcription) and multimodal vision tokens, alongside transformer architectures that unify modalities.
- 5
Quadrillion-scale training is framed around sensor logs and robotics, compressing raw visual/audio streams into semantically meaningful tokens.
- 6
Zettabyte-scale training implies city- and industry-wide standardized sensor data and 3D/real-time streams, but reasoning compute may deliver major gains earlier than waiting for that ladder.
- 7
Competitive advantage is portrayed as economically binding: data-center capacity and ongoing scaling efforts create an arms-race dynamic, while distillation spreads value into smaller models.