From 1T Tokens to ZB Scale—How to move past the internet and scale LLMs

TL;DR

Scaling beyond ~1T tokens hinges on improving *useful* training data quality, not just increasing token counts.

Briefing Cornell Notes

Briefing

The central claim is that today’s “trillion-token” training regimes hit a ceiling—not because data is impossible to obtain, but because models need *useful* data, not raw volume. Moving from ~1T tokens toward 10T and beyond requires a scaling framework focused on multilingual coverage, targeted expansion of underrepresented text, and stronger automated curation—so the training set better reflects how people actually communicate worldwide.

The proposed path to 10T tokens starts with multilingual training data that shifts away from an English-dominant corpus. Instead of 80–90% English, the target is closer to 30–40% English, with substantial representation from languages such as Chinese, Hindi, French, and Spanish. That shift is paired with “focused curated expansion” of historically underrepresented sources, plus tooling upgrades: automated cleaning, duplication pipelines that work across diverse languages and formats, and more flexible tokenizers that can adapt to morphological differences. The payoff is practical: models should feel more natural in languages beyond English, because current systems often underperform in languages like Indonesian when their training data is skewed toward English.

To go from 10T to 100T tokens, the data strategy broadens from curated text snapshots into continuous ingestion. The blueprint includes real-time web streams, social media feeds, large-scale transcription of podcasts and phone calls (with permission), and multimodal inputs—especially vision tokens. The jump also implies a major infrastructure step-change: immense compute, storage, and network capacity, plus transformer architectures that can unify text, images, and video within a single scalable framework. If models can be trained on continuous feeds at that scale, the door opens to continuously updated models that stay current on world events.

Beyond 100T tokens, the transcript sketches a further ladder: quadrillion-scale training would pull in sensor logs tied to robotics and embodied experience—camera streams, tactile sensors, motor commands, and Internet of Things data—compressed into semantically meaningful tokens. That requires systems that learn from interactions and unlabeled sensor streams, plus much larger GPU/TPU clusters. The farthest stage, zettabyte-scale (10^20 tokens), shifts toward city-scale and industry-wide data: 3D scans, full audio streams, simulation logs, and standardized real-time sensor data across sectors like healthcare, manufacturing, agriculture, and autonomous vehicles.

Yet the argument doesn’t treat zettabyte scale as a prerequisite for intelligence. The transcript suggests that meaningful progress toward general intelligence may come sooner through “reasoning scale,” where test-time compute can boost capability without touching training-data volume. It also notes a practical reality: current models are already “good enough” for many enterprise tasks (e.g., product requirements documents), even if they fall short for general-purpose work.

Finally, the discussion frames the race as economically and operationally binding. Scaling data and compute is expensive, so model makers may prefer to “rent” data-center capacity (a theme tied to Azure) rather than build everything in-house. Meanwhile, distillation—turning large models into smaller, cheaper ones—lets value propagate globally, but it also means competitive advantage depends on maintaining an edge through continual training and data scaling. The transcript ends by linking later-stage scaling to hardware ecosystems, with a strong suggestion that Nvidia’s GPU and robotics push is well-positioned for the sensor-heavy future.

In short: the next leap isn’t just “more tokens,” but a structured move toward multilingual, multimodal, continuously ingested, sensor-enriched data—tempered by the possibility that reasoning compute can accelerate capability without waiting for zettabyte-scale training.

Cornell Notes

The transcript argues that scaling LLMs past today’s ~1T-token training regimes depends on *useful* data, not just more data. The path to 10T tokens emphasizes multilingual coverage (shifting from English-heavy corpora toward a more global language mix), curated expansion of underrepresented sources, and improved automation for cleaning, deduplication, and tokenization across languages. Reaching 100T tokens requires continuous ingestion—real-time web, social streams, permissioned transcription, and multimodal vision tokens—plus transformer architectures that unify text and images at scale. For quadrillion and zettabyte regimes, sensor logs, robotics, and city/industry-scale standardized data become central, though reasoning scale may deliver major gains earlier without waiting for extreme token volumes.

Why does “useful data” matter more than raw token counts when scaling LLMs?

The transcript treats today’s training as a curated snapshot (often around ~1T tokens) rather than a complete, clean representation of the internet. It argues that the internet keeps growing and that private data streams also expand, but the bottleneck is quality and relevance—especially for multilingual performance. That’s why the proposed ladder focuses on curation tools, duplication handling across languages and formats, and tokenizers that adapt to morphological differences, aiming to make models better at expressing and understanding more languages rather than simply ingesting more text.

What concrete changes are proposed to move from ~1T tokens to ~10T tokens?

The plan is to make the training set more multilingual and more representative of global language use. Instead of 80–90% English, it targets roughly 30–40% English with substantial Chinese, Hindi, French, and Spanish content. It also calls for focused curated expansion of historically underrepresented texts, automated cleaning, duplication pipelines that work across diverse languages and formats, and flexible tokenizers that handle morphological variation. The expected outcome is more natural language behavior in non-English languages (e.g., Indonesian).

How does the strategy shift from 10T tokens to 100T tokens?

It expands from static curated text to continuous ingestion. The transcript lists real-time web streams, social media streams, large-scale transcription of podcasts and phone calls (with permission), and multimodal data such as vision tokens. It also stresses infrastructure: compute, storage, and network capacity far beyond current levels. Architecturally, it points to transformer designs that can unify text, images, and even video within one scalable framework, enabling continuously updated models that track world events.

What data sources become central at quadrillion and zettabyte token scales?

At quadrillion scale, the transcript brings in sensor logs tied to robotics and embodied experience: camera feeds, tactile sensors, motor commands, and Internet of Things data. It emphasizes compressing raw visual/audio frames into semantically meaningful tokens and learning from interactions and unlabeled sensor streams. At zettabyte scale, it moves toward city- and industry-scale standardized data: 3D scans, full audio streams, simulation logs, and real-time sensor data across healthcare, manufacturing, agriculture, and autonomous vehicles—requiring modular architectures that can handle multi-trillion-parameter systems.

Why might extreme token scale (zettabytes) not be necessary for general intelligence?

The transcript argues that capability can scale through reasoning without directly scaling training-data volume. It highlights “reasoning scale” as a second scaling law: increasing compute at test time can produce smarter behavior even if the training set doesn’t reach zettabyte levels. It also notes that current models already satisfy many enterprise requirements (like product requirements documents) using “good enough” model behavior, even if general-purpose performance remains incomplete.

Review Questions

Which specific data-management upgrades (cleaning, deduplication, tokenization) are presented as prerequisites for multilingual scaling to 10T tokens?
What multimodal and continuous-ingestion sources are listed as key ingredients for reaching 100T tokens?
How does “reasoning scale” function as an alternative path to capability growth compared with waiting for zettabyte-scale training data?

Key Points

1
Scaling beyond ~1T tokens hinges on improving *useful* training data quality, not just increasing token counts.
2
Moving toward ~10T tokens requires a multilingual shift—roughly 30–40% English rather than 80–90%—plus curated expansion of underrepresented texts.
3
Automated cleaning, cross-language duplication pipelines, and more flexible tokenizers are treated as essential for multilingual data at scale.
4
Reaching ~100T tokens depends on continuous ingestion (web, social, permissioned transcription) and multimodal vision tokens, alongside transformer architectures that unify modalities.
5
Quadrillion-scale training is framed around sensor logs and robotics, compressing raw visual/audio streams into semantically meaningful tokens.
6
Zettabyte-scale training implies city- and industry-wide standardized sensor data and 3D/real-time streams, but reasoning compute may deliver major gains earlier than waiting for that ladder.
7
Competitive advantage is portrayed as economically binding: data-center capacity and ongoing scaling efforts create an arms-race dynamic, while distillation spreads value into smaller models.

Highlights

The proposed 10T-token leap is less about volume and more about multilingual representativeness—shifting from English-heavy corpora to a global language mix.

Continuous ingestion plus multimodal vision tokens are presented as the core ingredients for moving from 10T to 100T tokens.

Quadrillion-scale training ties LLM capability to embodied sensor logs, requiring token compression and interaction learning.

Reasoning scale is offered as a separate scaling law that can boost intelligence without waiting for zettabyte-scale training data.

The scaling ladder is framed as hardware- and economics-driven, with data-center “rental” and distillation shaping how competitors maintain advantage.

Topics

Token Scaling
Multilingual Data
Multimodal Transformers
Reasoning Compute
Robotics Sensors

Mentioned

Azure
Nvidia
Gemini
Claude
ChatGPT
DeepSeek
Meta
Anthropic
Microsoft
Ilia Suver
Satya Nadella
Sacha
LLMs
01
03
IoT
TPU
GPU
AGI
SAA