Words to Bytes: Exploring Language Tokenizations | Sam Gbafa

TL;DR

Tokenization granularity (words, subwords, characters, bytes) can significantly affect language model learning outcomes even when the model architecture and training budget are held constant.

Briefing Cornell Notes

Briefing

Language tokenization choices can materially change how well a language model learns, but the “best” granularity depends on data size and model capacity. Sam Gbafa’s project tested word-, subword-, character-, and byte-level tokenizations using the same decoder-only transformer setup and found that finer segmentations don’t automatically win—especially when the training corpus is relatively small.

Gbafa built a baseline around standard autoregressive sequence modeling: the model predicts the next token from prior context, learning statistical relationships from a corpus such as Wikipedia. The project then zoomed in on tokenization as a key lever for sequence models. Prior work suggested that finer-grained tokenizations can outperform coarser ones, and that learning segmentations can improve generalization. To test those claims directly, Gbafa trained tokenizers on the same dataset (Wall Street Journal and Wikipedia articles) and compared how different token granularities affected training perplexity and validation behavior.

In concrete terms, word tokenization splits on whitespace, subword tokenization breaks words into smaller pieces (for example, “swimming” into “swim” and “-ming”), character tokenization splits text into individual characters, and byte tokenization represents characters via their underlying byte encodings (Unicode characters can take one to four bytes; for English, many characters map to one byte). The experiments used a 12-layer decoder-only transformer with about 80 million parameters, constant compute, and the same context length across tokenization schemes. Vocabulary sizes differed by design: word tokenization learned roughly 10,000 vocabulary words, subword tokenization used a 40,000 vocabulary, while character tokenization learned a much smaller set of unique characters.

The results complicated the expectation that subwords would always beat characters. While subword perplexity trends were not consistently superior, character tokenization performed better in this particular setup. Gbafa attributed the mismatch partly to the dataset scale: with a relatively small corpus, a 40,000 subword vocabulary can be too large, leaving many subword units undertrained. Validation perplexity was also high in at least one run, suggesting overfitting and weak generalization rather than a clean separation of tokenization quality.

The project’s takeaways were practical. Smaller segmentations can encode more nuanced information, but they may require a larger model to build useful representations across layers—transformers often construct word-level meaning gradually, and character-level modeling may need more capacity and training signal. Context length also matters: representing the same amount of information with smaller tokens effectively demands longer contexts. Gbafa further emphasized that the number of subword units is a critical hyperparameter, so tokenization sweeps should include multiple subword vocab sizes, and byte-level approaches likely need larger and more diverse multilingual data to pay off.

Beyond the experiments, Gbafa described learning outcomes from the OpenAI Scholars program: how to extract value from research papers, how data issues can derail training, and how overfitting, regularization, and hyperparameter optimization drive iterative improvements. The work also prompted reflection on the societal implications of generative models, including their effects on democracy and public trust.

Cornell Notes

Sam Gbafa’s project tested how tokenization granularity affects an autoregressive language model’s learning. Using a 12-layer, ~80M-parameter decoder-only transformer with constant compute and context length, he compared word, subword, character, and byte tokenizations trained on Wall Street Journal and Wikipedia. Contrary to the expectation that subwords always outperform, character tokenization performed better in this setup, likely because the training corpus was small relative to the 40,000 subword vocabulary. The findings suggest that finer segmentations require enough data, appropriate subword vocabulary sizing, and often more model capacity and/or longer context to generalize well. The work also highlights that tokenization hyperparameters and dataset scale jointly determine performance, not granularity alone.

Why does tokenization granularity matter for an autoregressive language model?

Tokenization determines what the model predicts at each step. With word tokenization, the next prediction is a whole word; with subwords, the next prediction can be a word fragment (e.g., “swim” then “-ming”); with characters, the model predicts individual characters; with bytes, it predicts byte-level encodings of Unicode characters. Finer granularity can expose morphology and sub-structure, but it also increases sequence length and the burden on the model to build higher-level representations from smaller units.

What experimental setup did Gbafa keep constant, and why is that important?

He used the same decoder-only transformer architecture (12 layers, about 80 million parameters), kept compute constant, and used the same context length across tokenization schemes. He also trained tokenizers on the same dataset sources (Wall Street Journal and Wikipedia). Holding these factors steady makes it easier to attribute performance differences to tokenization choices rather than to changes in model size or training budget.

What did the project find about subword vs. character tokenization?

The expectation was that subwords would outperform characters, but the observed training perplexity trends did not consistently support that. In at least one comparison run, character tokenization performed better than subword tokenization. Gbafa linked this partly to data scale: the subword vocabulary was large (40,000) while the corpus was relatively small, so many subword units may not have been learned well. He also noted signs of overfitting and high validation perplexity in that run.

How do vocabulary size and dataset size interact in these results?

Vocabulary size is a key hyperparameter for subword tokenization. A large subword vocabulary can be beneficial when there is enough data to estimate reliable statistics for many subword units. With limited training data, a 40,000 subword vocabulary can be too sparse, leaving undertrained pieces and weakening generalization—conditions under which character tokenization can look better because it relies on a smaller set of unique characters.

Why might smaller tokens require longer context or more capacity?

Smaller tokens increase the number of steps needed to represent the same amount of text. Gbafa argued that to represent the same information budget, context length may need to be longer for character- or subword-level tokenizations. He also suggested that transformers build representations progressively across layers; character-level modeling may need more model capacity to assemble word-like meaning from character sequences.

How does this work connect to multimodal modeling and segmentation learning?

Gbafa framed tokenization experiments as a baseline for a broader goal: learning segmentations and then studying how that improves performance. He pointed to prior research in multilingual translation where learning segmentations can improve outcomes (e.g., better results when segmentations are learned rather than fixed). That motivation extends to multimodal settings because segmentation structure can matter across modalities, not just text.

Review Questions

How would you expect training perplexity and validation perplexity to change if you increase subword vocabulary size without increasing dataset size?
What trade-offs arise when switching from word tokenization to character or byte tokenization in terms of sequence length, context requirements, and model capacity?
Which experimental controls in Gbafa’s setup help isolate the effect of tokenization, and which variables might still confound the comparison?

Key Points

1
Tokenization granularity (words, subwords, characters, bytes) can significantly affect language model learning outcomes even when the model architecture and training budget are held constant.
2
Finer tokenization does not guarantee better performance; in Gbafa’s experiments, character tokenization outperformed subword tokenization under a small-corpus regime.
3
A large subword vocabulary (e.g., 40,000) can underperform when training data is limited, because many subword units remain poorly learned.
4
Smaller tokens often require longer context length to represent the same amount of information and may need more model capacity to build higher-level representations.
5
Subword vocabulary size is a critical hyperparameter; meaningful comparisons should sweep multiple subword tokenizers rather than rely on a single segmentation scheme.
6
Byte-level tokenization is closely tied to Unicode encoding and may be more promising with larger, more diverse multilingual data rather than English-only corpora.
7
Training stability and generalization are strongly influenced by overfitting and regularization, so validation behavior is as important as training perplexity.

Highlights

Character tokenization beat subword tokenization in this controlled setup, challenging the assumption that finer granularity always wins.

The likely culprit was the mismatch between a 40,000 subword vocabulary and a relatively small training corpus, leading to weak generalization.

Smaller tokens shift the burden to the model: longer contexts and/or more capacity may be needed to reconstruct word-level meaning across transformer layers.

Tokenization isn’t just a preprocessing choice; it behaves like a set of interacting hyperparameters with data scale and context length.

Topics

Language Tokenization
Autoregressive Modeling
Subword Vocabularies
Perplexity
Sequence Modeling

Mentioned

Sam Gbafa
Arvind
LSTMs
GT3
GT2
GPT
WSJ

Words to Bytes: Exploring Language Tokenizations | Sam Gbafa | OpenAI Scholars Demo Day 2021