LLaMA2 for Multilingual Fine Tuning?

TL;DR

Tokenizer fit is the first gate for multilingual fine-tuning; test tokenization before training or buying compute.

Briefing Cornell Notes

Briefing

Multilingual fine-tuning with LLaMA 2 hinges less on the model weights and more on whether its tokenizer breaks your target language into efficient subword units. When the tokenizer turns a simple phrase into many tokens—or worse, into near character-by-character pieces—fine-tuning becomes data-hungry and harder, because the model must predict many more next-token steps for the same meaning.

LLaMA 2’s training data is heavily English, with additional coverage for languages such as German, French, Spanish, Russian, and some East Asian languages. Even where those languages appear in training, the practical bottleneck shows up during tokenization. The tokenizer is shared across LLaMA 2 variants and uses a fixed vocabulary size of 32,000 tokens. That matters because generation and training both rely on predicting the next token via a softmax over those 32,000 classes. If a language’s writing system forces the tokenizer to fragment text into many tokens, the model effectively has to make far more predictions to express the same content.

In the notebook walkthrough, English and French behave well: “My name is Sam” stays compact, using roughly one token per word (about four tokens total), and casing differences (like capital letters) don’t explode token counts. Spanish is similarly efficient. But Thai illustrates the failure mode. Thai characters can combine consonants and vowels in multiple positions, and the tokenizer ends up splitting what should be a single character/visual unit into multiple Unicode components. The result is a jump from about four tokens in English/French to roughly fourteen tokens for the Thai equivalent—meaning fine-tuning would likely require more data to learn stable mappings.

Greek is even harsher in this framing: the tokenizer tends toward one token per character, which is described as the worst case for fine-tuning because it removes the subword structure that language models benefit from. Chinese is mixed: some characters tokenize cleanly, but other characters require multiple Unicode pieces, pushing token counts higher (around twelve tokens for the example phrase). The takeaway is pragmatic: before spending time or money on multilingual fine-tuning, test tokenization for the target language and look for excessive fragmentation.

The transcript then compares other model families through the same lens. BLOOM uses a much larger tokenizer (250,000 tokens) and relies on byte-pair encoding, which can increase the difficulty of next-token prediction due to the larger softmax space. GLM2, a bilingual English–Chinese model, has a tokenizer around 64,000 tokens and performs better than LLaMA 2 for Chinese, while still struggling with Thai and not being trained for Greek. MT5-style multilingual models use very large tokenizers (around 250,000 tokens) but can handle Thai and Greek more effectively than LLaMA 2 in the examples shown, including reducing Thai token counts from the “14–15 tokens” range down to about four.

Finally, some open-source LLaMA-derived models use bigger tokenizers (e.g., 50K) and BPE, yet still show poor tokenization for Thai, Greek, and likely Chinese—so bigger tokenizers alone don’t guarantee multilingual readiness. The core recommendation is straightforward: validate tokenizer fit first, then choose a model whose tokenizer supports the target language’s scripts with subword grouping rather than character-level fragmentation.

Cornell Notes

LLaMA 2’s multilingual fine-tuning quality is strongly constrained by its tokenizer, not just by the model. The tokenizer uses a 32,000-token vocabulary and drives training/generation through next-token prediction over those classes. For English and many Western European languages, common phrases tokenize efficiently (often near one token per word). For scripts with complex Unicode composition—like Thai—or for languages where the tokenizer falls back to character-level pieces—like Greek—token counts rise sharply (e.g., ~14 tokens for Thai vs ~4 for English), making fine-tuning harder and more data-hungry. The transcript recommends testing tokenization for the target language before committing to fine-tuning, and comparing alternative models whose tokenizers better match the script.

Why does tokenizer token-count matter so much for fine-tuning LLaMA 2 on a new language?

Because LLaMA 2 predicts the next token using a softmax over its tokenizer vocabulary (32,000 classes). If the tokenizer splits one meaningful phrase into many tokens, the model must make many more next-token predictions to represent the same content. That increases the effective sequence length and typically makes learning more data-intensive and less stable during fine-tuning.

What tokenization behavior looks “good” in the examples for LLaMA 2?

English and French examples stay compact: “My name is Sam” uses about four tokens, roughly matching the number of words. The tokenizer also handles casing differences (capital vs lowercase) without exploding token counts, which suggests it has subword units that align well with Latin-script text.

What goes wrong with Thai tokenization under LLaMA 2, and what does it imply?

Thai’s writing system allows vowels to appear before/above/below/after consonants, and the example shows that what should be a single visual/linguistic unit ends up represented by multiple Unicode components. Under LLaMA 2, the Thai phrase “pom cheu Sam” tokenizes into about 14 tokens instead of ~4 for English/French. The implication is that fine-tuning would likely require more data because the model must learn from a more fragmented token stream.

How does Greek tokenization under LLaMA 2 affect expectations for fine-tuning?

Greek is described as worse: the tokenizer tends toward one token per character. That’s unfavorable because it removes the subword grouping that language models rely on for efficient learning. The transcript frames this as a “worst case” scenario for fine-tuning compared with subword or word-like tokenization.

How do alternative multilingual models compare when judged by tokenizer fit?

BLOOM uses a much larger tokenizer (250,000 tokens) with byte-pair encoding; it’s described as good for English/French, not good for Thai/Greek, and better for Chinese than LLaMA 2. GLM2 (English–Chinese) uses ~64,000 tokens and is stronger for Chinese, still not great for Thai, and not trained for Greek. MT5 multilingual models use very large tokenizers (~250,000) but can reduce Thai token counts (down to about four in the example) and handle Greek better than LLaMA 2, while still being less specialized than Chinese-focused models.

What practical workflow does the transcript recommend before fine-tuning?

Test the tokenizer on the target language first. If tokenization chops text into character-by-character pieces or requires multiple Unicode tokens per character, LLaMA 2 fine-tuning is expected to underperform. Then search Hugging Face for multilingual models whose tokenizers are designed to group the target script into subwords, avoiding wasted effort on a poor tokenizer-model match.

Review Questions

If LLaMA 2’s tokenizer turns a target-language sentence into 3–4× more tokens than English, how does that change the fine-tuning problem in terms of next-token prediction steps?
Which scripts in the examples are most problematic for LLaMA 2 tokenization, and what specific tokenization pattern is cited for each (e.g., Unicode fragmentation vs character-level tokens)?
When comparing BLOOM, GLM2, and MT5, what tokenizer-related signals are used to predict which languages will fine-tune more effectively?

Key Points

1
Tokenizer fit is the first gate for multilingual fine-tuning; test tokenization before training or buying compute.
2
LLaMA 2 uses a 32,000-token vocabulary, so languages that fragment into many tokens force more next-token predictions.
3
English and many Western European languages tokenize efficiently under LLaMA 2, often near one token per word.
4
Thai and Greek are highlighted as problematic because Thai can require multiple Unicode tokens per character and Greek can degrade toward character-level tokenization.
5
Chinese is mixed under LLaMA 2: some characters tokenize well, but others split into multiple Unicode pieces, increasing token counts.
6
Model comparisons (BLOOM, GLM2, MT5) should be made by tokenizer behavior on the target script, not by model size alone.
7
A practical approach is to test the target language in a tokenizer notebook and then select a multilingual model whose tokenizer groups the script into subwords.

Highlights

LLaMA 2’s multilingual fine-tuning bottleneck is often the tokenizer: more tokens per phrase means more next-token steps and more data needed.

Thai tokenization under LLaMA 2 jumps from ~4 tokens (English/French example) to ~14 tokens due to Unicode composition and vowel/consonant placement.

Greek under LLaMA 2 trends toward one token per character—described as the worst setup for fine-tuning efficiency.

MT5’s multilingual tokenizer is shown reducing Thai token counts dramatically (from ~14–15 down to ~4 in the example), illustrating why tokenizer choice can outweigh model choice.

Topics

Tokenizer Efficiency
Multilingual Fine-Tuning
Unicode Tokenization
Model Comparison
LLaMA 2

Mentioned

BPE
MT5
T5
GLM2
LLaMA