LLaMA2 for Multilingual Fine Tuning?
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Tokenizer fit is the first gate for multilingual fine-tuning; test tokenization before training or buying compute.
Briefing
Multilingual fine-tuning with LLaMA 2 hinges less on the model weights and more on whether its tokenizer breaks your target language into efficient subword units. When the tokenizer turns a simple phrase into many tokens—or worse, into near character-by-character pieces—fine-tuning becomes data-hungry and harder, because the model must predict many more next-token steps for the same meaning.
LLaMA 2’s training data is heavily English, with additional coverage for languages such as German, French, Spanish, Russian, and some East Asian languages. Even where those languages appear in training, the practical bottleneck shows up during tokenization. The tokenizer is shared across LLaMA 2 variants and uses a fixed vocabulary size of 32,000 tokens. That matters because generation and training both rely on predicting the next token via a softmax over those 32,000 classes. If a language’s writing system forces the tokenizer to fragment text into many tokens, the model effectively has to make far more predictions to express the same content.
In the notebook walkthrough, English and French behave well: “My name is Sam” stays compact, using roughly one token per word (about four tokens total), and casing differences (like capital letters) don’t explode token counts. Spanish is similarly efficient. But Thai illustrates the failure mode. Thai characters can combine consonants and vowels in multiple positions, and the tokenizer ends up splitting what should be a single character/visual unit into multiple Unicode components. The result is a jump from about four tokens in English/French to roughly fourteen tokens for the Thai equivalent—meaning fine-tuning would likely require more data to learn stable mappings.
Greek is even harsher in this framing: the tokenizer tends toward one token per character, which is described as the worst case for fine-tuning because it removes the subword structure that language models benefit from. Chinese is mixed: some characters tokenize cleanly, but other characters require multiple Unicode pieces, pushing token counts higher (around twelve tokens for the example phrase). The takeaway is pragmatic: before spending time or money on multilingual fine-tuning, test tokenization for the target language and look for excessive fragmentation.
The transcript then compares other model families through the same lens. BLOOM uses a much larger tokenizer (250,000 tokens) and relies on byte-pair encoding, which can increase the difficulty of next-token prediction due to the larger softmax space. GLM2, a bilingual English–Chinese model, has a tokenizer around 64,000 tokens and performs better than LLaMA 2 for Chinese, while still struggling with Thai and not being trained for Greek. MT5-style multilingual models use very large tokenizers (around 250,000 tokens) but can handle Thai and Greek more effectively than LLaMA 2 in the examples shown, including reducing Thai token counts from the “14–15 tokens” range down to about four.
Finally, some open-source LLaMA-derived models use bigger tokenizers (e.g., 50K) and BPE, yet still show poor tokenization for Thai, Greek, and likely Chinese—so bigger tokenizers alone don’t guarantee multilingual readiness. The core recommendation is straightforward: validate tokenizer fit first, then choose a model whose tokenizer supports the target language’s scripts with subword grouping rather than character-level fragmentation.
Cornell Notes
LLaMA 2’s multilingual fine-tuning quality is strongly constrained by its tokenizer, not just by the model. The tokenizer uses a 32,000-token vocabulary and drives training/generation through next-token prediction over those classes. For English and many Western European languages, common phrases tokenize efficiently (often near one token per word). For scripts with complex Unicode composition—like Thai—or for languages where the tokenizer falls back to character-level pieces—like Greek—token counts rise sharply (e.g., ~14 tokens for Thai vs ~4 for English), making fine-tuning harder and more data-hungry. The transcript recommends testing tokenization for the target language before committing to fine-tuning, and comparing alternative models whose tokenizers better match the script.
Why does tokenizer token-count matter so much for fine-tuning LLaMA 2 on a new language?
What tokenization behavior looks “good” in the examples for LLaMA 2?
What goes wrong with Thai tokenization under LLaMA 2, and what does it imply?
How does Greek tokenization under LLaMA 2 affect expectations for fine-tuning?
How do alternative multilingual models compare when judged by tokenizer fit?
What practical workflow does the transcript recommend before fine-tuning?
Review Questions
- If LLaMA 2’s tokenizer turns a target-language sentence into 3–4× more tokens than English, how does that change the fine-tuning problem in terms of next-token prediction steps?
- Which scripts in the examples are most problematic for LLaMA 2 tokenization, and what specific tokenization pattern is cited for each (e.g., Unicode fragmentation vs character-level tokens)?
- When comparing BLOOM, GLM2, and MT5, what tokenizer-related signals are used to predict which languages will fine-tune more effectively?
Key Points
- 1
Tokenizer fit is the first gate for multilingual fine-tuning; test tokenization before training or buying compute.
- 2
LLaMA 2 uses a 32,000-token vocabulary, so languages that fragment into many tokens force more next-token predictions.
- 3
English and many Western European languages tokenize efficiently under LLaMA 2, often near one token per word.
- 4
Thai and Greek are highlighted as problematic because Thai can require multiple Unicode tokens per character and Greek can degrade toward character-level tokenization.
- 5
Chinese is mixed under LLaMA 2: some characters tokenize well, but others split into multiple Unicode pieces, increasing token counts.
- 6
Model comparisons (BLOOM, GLM2, MT5) should be made by tokenizer behavior on the target script, not by model size alone.
- 7
A practical approach is to test the target language in a tokenizer notebook and then select a multilingual model whose tokenizer groups the script into subwords.