Kyutai STT & TTS - A Perfect Local Voice Solution?

TL;DR

QI TTS and Kyutai’s speech-to-text release target fast local speech workflows, currently supporting English and French.

Briefing Cornell Notes

Briefing

Kyutai’s latest local speech stack—QI TTS for text-to-speech and speech-to-text for ASR—pairs fast, small models with voice conditioning that can sound remarkably natural, including on-the-fly voice cloning from a short sample. The practical catch: Kyutai is not releasing the voice embedding/cloning model itself, limiting users to a curated set of downloadable voice embeddings and preventing straightforward fine-tuning for new languages or custom voices.

The ASR side (speech-to-text) is currently limited to English and French, but it transcribes quickly and appears to run efficiently on capable hardware. The bigger story is the architecture Kyutai has been building toward: low-latency speech in, speech out, with ASR feeding into a larger language model and then returning through TTS. Earlier experiments reportedly showed odd behavior when prompting the integrated model, but the latency-focused design remained the key attraction—making local, interactive voice workflows more feasible.

On the TTS side, Kyutai’s QI TTS uses a 1.6B model and supports English and French, with multiple selectable voices. The quality is framed as competitive with other well-known TTS systems (including comparisons to Chatterbox, Dier, and 11 Labs), and the release includes a strong demonstration of voice cloning: TTS is conditioned on a 10-second voice sample, reproducing intonation and voice characteristics even when the sample is “weird” or stylized. In the example, the system quickly generates speech in the target voice, suggesting the model can capture prosody and identity cues from limited audio.

Where the release frustrates power users is in what’s missing. Kyutai provides voice embeddings and a repository of voices based on datasets such as Espresso and VCTK, but explicitly withholds the voice embedding model to ensure cloning happens only with consent. That means developers can’t directly fine-tune the cloning pipeline for other languages or train their own voice-to-embedding mapping without additional tooling or data. The transcript notes Kyutai is “exploring ideas” for expanding beyond English and French, but no fine-tuning path is available yet.

The practical workaround comes from the released materials on Hugging Face and Kyutai’s GitHub code examples. The voices are distributed as safe tensor embeddings (with a “safe tensors” option for embeddings and a “web” option for listening). By loading an embedding, conditioning the QI TTS model on it, and generating audio from custom text, the user can produce speech in the provided voices. The transcript also shows experimentation with voice embeddings: inspecting their tensor shape, saving them, and blending two voices by averaging embeddings to create an in-between “latent space” voice. The result is speech that sits between the two source voices rather than simply switching.

Overall, the release is positioned as a strong local foundation for ASR and TTS, with additional promise if/when an MLX version arrives for Mac hardware. For now, the system delivers fast, high-quality speech synthesis and usable voice control—just not the full cloning model needed for unrestricted customization.

Cornell Notes

Kyutai’s QI TTS and speech-to-text releases bring fast local speech capabilities, currently limited to English and French. QI TTS uses a 1.6B model and can generate natural-sounding speech conditioned on voice identity, including a demo of voice cloning from a 10-second sample. The quality is competitive in side-by-side comparisons, but Kyutai does not release the voice embedding/cloning model directly. Instead, users get a repository of pre-made voice embeddings (safe tensors) and can run Kyutai’s code to synthesize speech, pick voices, and even blend voices by averaging embeddings. This enables experimentation, but it blocks straightforward fine-tuning for new languages or fully custom voice cloning.

What makes Kyutai’s speech stack feel “local” and low-latency, and where does the transcript say the key workflow fits?

The transcript emphasizes a pipeline designed for quick turn-taking: speech-to-text (ASR) feeds into a language model and then returns through text-to-speech (TTS). The original experiments focused on low latency entering and leaving the model, and the new releases provide separate ASR and TTS components that can run locally on decent hardware. The ASR demo is described as very quick transcription, while QI TTS is described as fast generation with a relatively small 1.6B model.

How does QI TTS handle voice identity, and what evidence is given for voice cloning quality?

QI TTS conditions its output on voice identity, with the transcript highlighting a voice cloning demo conditioned on a 10-second voice sample. The generated speech is described as reproducing intonation and voice characteristics even when the sample is stylized. The transcript also notes that multiple voices can be selected for TTS, and that the output sounds “pretty good” in the provided examples.

Why can’t users fully clone arbitrary voices or fine-tune the system for other languages, according to the transcript?

Kyutai withholds the voice embedding model directly. The transcript says this is done to ensure voices are only cloned consensually, so instead of releasing the embedding model, Kyutai provides a repository of voices based on dataset samples such as Espresso and VCTK. As a result, users can’t easily fine-tune the cloning pipeline for new languages or train a voice-to-embedding mapping using the released components alone.

What exactly is released for voice control, and how does the transcript use it in code?

The release includes pre-made voice embeddings distributed as safe tensors, plus a web interface for listening to the voices. In the code walkthrough, the user loads a chosen voice’s safe tensor embedding, converts/feeds it into the model’s conditioning step, preprocesses the text, and then generates audio. The transcript shows that once the voice embedding is set, subsequent generations only require text preprocessing and generation.

How does the transcript demonstrate creating new “in-between” voices from existing embeddings?

It loads two different voice embeddings and blends them by averaging in the latent space. The transcript describes listening to each source voice first, then averaging them to save a blended embedding file, and finally generating audio using that blended embedding. The resulting speech is described as between the two voices rather than matching either one exactly.

What dataset and model-labeling details are mentioned for training the TTS system?

The transcript points to the QI TTS model card on Hugging Face, noting that training used 2.5 million hours of data. It also says the data was labeled using Whisper Media, which ties the training pipeline to Whisper-based transcription/labeling.

Review Questions

What components make up Kyutai’s local speech workflow, and how does the transcript connect ASR output to TTS input through a low-latency pipeline?
How does the transcript’s code approach voice conditioning using safe tensor embeddings, and what steps change when generating multiple utterances with the same voice?
What are the practical consequences of Kyutai not releasing the voice embedding model, and how does embedding blending partially compensate for that limitation?

Key Points

1
QI TTS and Kyutai’s speech-to-text release target fast local speech workflows, currently supporting English and French.
2
QI TTS uses a 1.6B model and can produce natural-sounding speech conditioned on voice identity, including a 10-second voice-sample cloning demo.
3
Kyutai provides voice embeddings and a voice repository, but withholds the voice embedding/cloning model to restrict non-consensual cloning.
4
Users can run Hugging Face/GitHub examples to load safe tensor voice embeddings, condition the TTS model, and generate audio from custom text.
5
Voice embeddings can be inspected and manipulated in PyTorch, including saving and blending embeddings by averaging to create intermediate voices.
6
The transcript highlights training scale for QI TTS: 2.5 million hours of data labeled with Whisper Media.
7
An MLX version is flagged as a likely next step for running locally on a Mac laptop.

Highlights

QI TTS is conditioned on a 10-second voice sample and can reproduce intonation and voice characteristics quickly, even with stylized input.

The release is strong on usability—safe tensor voice embeddings plus working code—but Kyutai does not release the voice embedding model needed for full cloning and fine-tuning.

Blending two voices by averaging their embeddings produces speech that lands between the two source voices in latent space.

The QI TTS model card cites 2.5 million hours of training data labeled with Whisper Media, signaling a large-scale training pipeline.

Topics

Local Speech
QI TTS
Speech-to-Text
Voice Embeddings
Voice Cloning

Mentioned

Sam Witteveen
ASR
TTS
MLX
VCTK
QI
STT