Tiny Aya - Cohere's Mini Multilingual Models
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Low-resource languages can fail due to limited training data and inefficient tokenization that fragments non-Latin scripts into many tokens.
Briefing
Choosing a language model for non-English languages is often a guessing game—especially for low-resource languages with limited internet data and tokenization schemes that break words into inefficient fragments. The core insight here is that Coher’s new “tiny” multilingual models are designed to close that gap by combining broad pretraining across 70+ languages with region-focused post-training, then packaging the result into small ~3.3B-parameter models that can be run and fine-tuned more easily than large multilingual systems.
The transcript lays out why recommendations are hard. Some languages may never have appeared in training, largely because low-resource languages lack enough web data—sometimes because communities don’t use Wikipedia extensively, leaving few pages to harvest. Even when training data exists, tokenization can quietly sabotage performance: older tokenizers (the example given is the Llama 2 tokenizer) can require many more tokens for scripts like Thai or Greek, forcing the model to learn meaning from near character-by-character pieces. That inefficiency can make multilingual learning harder, even if the model is otherwise capable.
Recent progress has improved multilingual coverage for major languages, helped by better tokenizers and multilingual datasets. Projects such as Translate Gemma (built on the Gemma 3 stack) and improvements in models like Llama 3 and Quen variants are credited with expanding language support—often by adding multilingual data and using larger tokenizers (the transcript cites 250k+ tokenizers for Translate Gemma). But smaller models still lag because they’re typically trained on fewer tokens and their post-training recipes may not prioritize multilingual behavior.
Coher’s response is a suite of tiny multilingual models released as both a research artifact and practical starting points. The base model is pre-trained on 70+ languages, explicitly including data from many low-resource languages. On top of that, four post-trained variants are built from the same base: a “tiny global” instruction-tuned model meant to cover most pretraining languages, plus three region-specialized models created by mixing languages grouped by geography and linguistic relatedness.
Those region models are “tiny earth” (West Asia, Africa, and some European languages—Arabic, Turkish, Hebrew, plus 10 African languages and 31 European languages), “tiny fire” (South Asia, emphasizing scripts that differ strongly—Hindi, Bengali, Tamil, Nepali, with English often appearing via code-switching), and “tiny water” (Asia-Pacific languages such as Tagalog, Bahasa, Vietnamese, Thai, Chinese, plus low-resource languages like Cam and Burmese, while also including some West Asia and European mixes). The post-training approach is described as region-specific SFT models that are then merged into the released variants.
A key technical detail is that Coher trained its own tokenizer rather than using an off-the-shelf one. Efficiency varies by language: the transcript notes Coher’s tokenizer can beat Gemma 3’s for some languages, while Gemma 3 may be slightly better for others—so results should be checked per target language. The suite also includes quantized versions intended for immediate use in lightweight runtimes (e.g., via Llama-style tooling). With ~3B models, the practical pitch is clear: these can run on phones and support mobile apps for countries and languages underserved by larger multilingual models.
Overall, the release positions Coher’s tiny models as a practical option when mainstream multilingual models underperform—especially for specific regions and low-resource languages—while also fitting the growing trend toward better multilingual tokenization and multilingual-focused training recipes in smaller architectures.
Cornell Notes
Coher’s new tiny multilingual models target a common failure mode in language AI: low-resource languages can be missing from training data and may be handled inefficiently by tokenizers that split non-Latin scripts into too many fragments. The suite starts with a ~3.3B-parameter base model pre-trained on 70+ languages, then adds post-trained variants designed for broader coverage (“tiny global”) and for region/script groupings (“tiny earth,” “tiny fire,” “tiny water”). The region models are built by merging region-specific SFT models, and Coher trains its own tokenizer to improve token efficiency across languages. Quantized versions are also provided for easier deployment, including phone-scale use cases. The practical takeaway: pick the global model for breadth, or try a region model for better results on specific languages.
Why do low-resource languages often perform poorly in general-purpose multilingual models?
How does Coher’s model suite try to fix multilingual performance in small models?
What distinguishes the “tiny global” model from the region models?
What language groupings are used for the region-specialized models?
Why does tokenizer choice matter, and what does Coher do differently?
How are these models meant to be used in practice?
Review Questions
- If a target language has limited Wikipedia presence, which two factors described here are most likely to hurt model performance?
- When would you choose “tiny global” over “tiny earth,” “tiny fire,” or “tiny water,” and why?
- How can tokenizer efficiency change the number of tokens needed to represent the same text, and what impact might that have on learning?
Key Points
- 1
Low-resource languages can fail due to limited training data and inefficient tokenization that fragments non-Latin scripts into many tokens.
- 2
Coher’s tiny suite starts with a base model pre-trained on 70+ languages, including low-resource languages.
- 3
“Tiny global” is instruction tuned for broad multilingual coverage, while the other variants specialize by region/script grouping.
- 4
“Tiny earth,” “tiny fire,” and “tiny water” are built by merging region-specific SFT models into released multilingual variants.
- 5
Coher trained its own tokenizer, and token efficiency varies by language compared with Gemma 3’s tokenizer.
- 6
Quantized versions are provided for easier deployment, supporting phone-scale use cases and mobile app development.
- 7
For best results, evaluate per-language performance and consider fine-tuning rather than relying on a single general model.