Tiny Aya - Cohere's Mini Multilingual Models

TL;DR

Low-resource languages can fail due to limited training data and inefficient tokenization that fragments non-Latin scripts into many tokens.

Briefing Cornell Notes

Briefing

Choosing a language model for non-English languages is often a guessing game—especially for low-resource languages with limited internet data and tokenization schemes that break words into inefficient fragments. The core insight here is that Coher’s new “tiny” multilingual models are designed to close that gap by combining broad pretraining across 70+ languages with region-focused post-training, then packaging the result into small ~3.3B-parameter models that can be run and fine-tuned more easily than large multilingual systems.

The transcript lays out why recommendations are hard. Some languages may never have appeared in training, largely because low-resource languages lack enough web data—sometimes because communities don’t use Wikipedia extensively, leaving few pages to harvest. Even when training data exists, tokenization can quietly sabotage performance: older tokenizers (the example given is the Llama 2 tokenizer) can require many more tokens for scripts like Thai or Greek, forcing the model to learn meaning from near character-by-character pieces. That inefficiency can make multilingual learning harder, even if the model is otherwise capable.

Recent progress has improved multilingual coverage for major languages, helped by better tokenizers and multilingual datasets. Projects such as Translate Gemma (built on the Gemma 3 stack) and improvements in models like Llama 3 and Quen variants are credited with expanding language support—often by adding multilingual data and using larger tokenizers (the transcript cites 250k+ tokenizers for Translate Gemma). But smaller models still lag because they’re typically trained on fewer tokens and their post-training recipes may not prioritize multilingual behavior.

Coher’s response is a suite of tiny multilingual models released as both a research artifact and practical starting points. The base model is pre-trained on 70+ languages, explicitly including data from many low-resource languages. On top of that, four post-trained variants are built from the same base: a “tiny global” instruction-tuned model meant to cover most pretraining languages, plus three region-specialized models created by mixing languages grouped by geography and linguistic relatedness.

Those region models are “tiny earth” (West Asia, Africa, and some European languages—Arabic, Turkish, Hebrew, plus 10 African languages and 31 European languages), “tiny fire” (South Asia, emphasizing scripts that differ strongly—Hindi, Bengali, Tamil, Nepali, with English often appearing via code-switching), and “tiny water” (Asia-Pacific languages such as Tagalog, Bahasa, Vietnamese, Thai, Chinese, plus low-resource languages like Cam and Burmese, while also including some West Asia and European mixes). The post-training approach is described as region-specific SFT models that are then merged into the released variants.

A key technical detail is that Coher trained its own tokenizer rather than using an off-the-shelf one. Efficiency varies by language: the transcript notes Coher’s tokenizer can beat Gemma 3’s for some languages, while Gemma 3 may be slightly better for others—so results should be checked per target language. The suite also includes quantized versions intended for immediate use in lightweight runtimes (e.g., via Llama-style tooling). With ~3B models, the practical pitch is clear: these can run on phones and support mobile apps for countries and languages underserved by larger multilingual models.

Overall, the release positions Coher’s tiny models as a practical option when mainstream multilingual models underperform—especially for specific regions and low-resource languages—while also fitting the growing trend toward better multilingual tokenization and multilingual-focused training recipes in smaller architectures.

Cornell Notes

Coher’s new tiny multilingual models target a common failure mode in language AI: low-resource languages can be missing from training data and may be handled inefficiently by tokenizers that split non-Latin scripts into too many fragments. The suite starts with a ~3.3B-parameter base model pre-trained on 70+ languages, then adds post-trained variants designed for broader coverage (“tiny global”) and for region/script groupings (“tiny earth,” “tiny fire,” “tiny water”). The region models are built by merging region-specific SFT models, and Coher trains its own tokenizer to improve token efficiency across languages. Quantized versions are also provided for easier deployment, including phone-scale use cases. The practical takeaway: pick the global model for breadth, or try a region model for better results on specific languages.

Why do low-resource languages often perform poorly in general-purpose multilingual models?

Two main issues are highlighted. First, training data may be scarce because those languages have limited internet presence—sometimes because communities don’t use Wikipedia much, leaving few pages to learn from. Second, tokenization can be inefficient: older tokenizers can turn scripts like Thai or Greek into many more tokens (sometimes near character-by-character), making it harder for the model to learn meaning from the same number of words.

How does Coher’s model suite try to fix multilingual performance in small models?

It combines broad pretraining with multilingual-aware post-training. The base model is pre-trained on 70+ languages, including low-resource languages. Then four post-trained models are built from that base: “tiny global” for wide instruction-following coverage, plus three region-specialized variants created by mixing languages grouped by geography and relatedness.

What distinguishes the “tiny global” model from the region models?

“Tiny global” is instruction tuned and balanced to work across most languages seen during pretraining, making it the default choice when the goal is maximum coverage. The region models—“tiny earth,” “tiny fire,” and “tiny water”—are specialized by language grouping and are intended to improve performance for particular sets of languages and scripts.

What language groupings are used for the region-specialized models?

“Tiny earth” focuses on West Asia, Africa, and some European languages (including Arabic, Turkish, Hebrew, 10 African languages, and 31 European languages). “Tiny fire” targets South Asia with emphasis on scripts that differ strongly (Hindi, Bengali, Tamil, Nepali), while English may appear through code-switching. “Tiny water” covers Asia-Pacific languages (Tagalog, Bahasa, Vietnamese, Thai, Chinese) and includes low-resource languages like Cam and Burmese, plus some West Asia and European mixes.

Why does tokenizer choice matter, and what does Coher do differently?

Tokenizer efficiency affects how many tokens represent the same meaning, which changes how easily a model can learn. The transcript notes that Coher trained its own tokenizer rather than using an off-the-shelf one. Reported results suggest Coher’s tokenizer can be more efficient than Gemma 3’s for some languages, while Gemma 3 can be slightly better for others—so performance should be evaluated per target language.

How are these models meant to be used in practice?

They’re small enough (~3.3B parameters) to support lightweight deployment. The suite includes quantized versions for immediate use in common tooling (the transcript mentions using them in a Llama-style setup). The practical recommendation is to start with the global model or swap in a region model, then fine-tune for the specific language(s) needed—especially for mobile apps in countries where larger multilingual models underperform.

Review Questions

If a target language has limited Wikipedia presence, which two factors described here are most likely to hurt model performance?
When would you choose “tiny global” over “tiny earth,” “tiny fire,” or “tiny water,” and why?
How can tokenizer efficiency change the number of tokens needed to represent the same text, and what impact might that have on learning?

Key Points

1
Low-resource languages can fail due to limited training data and inefficient tokenization that fragments non-Latin scripts into many tokens.
2
Coher’s tiny suite starts with a base model pre-trained on 70+ languages, including low-resource languages.
3
“Tiny global” is instruction tuned for broad multilingual coverage, while the other variants specialize by region/script grouping.
4
“Tiny earth,” “tiny fire,” and “tiny water” are built by merging region-specific SFT models into released multilingual variants.
5
Coher trained its own tokenizer, and token efficiency varies by language compared with Gemma 3’s tokenizer.
6
Quantized versions are provided for easier deployment, supporting phone-scale use cases and mobile app development.
7
For best results, evaluate per-language performance and consider fine-tuning rather than relying on a single general model.

Highlights

Tokenization inefficiency can turn some languages into many more tokens than English—making multilingual learning harder even when a model is “multilingual.”

Coher’s tiny models combine 70+ language pretraining with region-focused post-training created by merging region-specific SFT models.

The suite includes a broad “tiny global” model plus three region specialists: “tiny earth,” “tiny fire,” and “tiny water,” each mapped to specific language sets.

A custom tokenizer is trained for these models, with reported efficiency advantages for some languages versus Gemma 3 and tradeoffs for others.

With ~3B parameters and quantized releases, these models are positioned for real deployment, including on phones.

Topics

Multilingual Language Models
Tokenization
Low-Resource Languages
Region-Specific Fine-Tuning
On-Device AI

Mentioned

Llama 2
Gemma 3
Quen 3.5
Translate Gemma
Gemma
Sam Witteveen
SFT