MetaVoice 1B - TTS & Voice Cloning

TL;DR

MetaVoice 1B is an Apache-licensed, 1.2B-parameter open TTS model released with a GitHub repo and notebook workflow for experimentation.

Briefing Cornell Notes

Briefing

MetaVoice has released MetaVoice 1B, a 1.2B-parameter, Apache-licensed text-to-speech model aimed at open experimentation—along with a GitHub repo and a Colab-style workflow for generating speech and testing voice cloning. The headline promise is zero-shot voice cloning for American and British accents using roughly 30 seconds of reference audio, trained on an unusually large 100,000 hours of speech data for a startup. MetaVoice also positions the system around “emotional speech” (rhythm and tone) in English and claims it avoids “hallucinations” like inventing plausible-sounding words—an issue that earlier TTS models, including Bark in the creator’s prior tests, sometimes exhibited.

Architecturally, MetaVoice 1B blends multiple Transformer components—both causal and non-causal—followed by a multi-band diffusion stage. A separate neural module described as a “deep filter net” is used to remove unwanted artifacts (including out-of-sound elements). In practice, the model is offered with a demo and then validated through hands-on generation: the system can produce recognizable celebrity-like voices, but reliability varies. Some generations succeed cleanly; others crash or error, suggesting that usage load, runtime constraints, or edge-case inputs may affect stability.

The practical test centers on a notebook that installs required dependencies (including Flash Attention) and lets users upload or point to reference voice files (WebM or MP3, with the workflow assuming 30 seconds per voice). Users then generate speech by adjusting key parameters: the speaker conditional path (which selects the reference voice), the input text, temperature, guidance scale, and an output path. The creator focuses on temperature and guidance scale because they strongly influence accent flavor and intelligibility.

Results are mixed in a specific way. The model often avoids outright hallucinated words, but it can drop content—producing silence or missing segments—especially with certain voices and parameter settings. For example, the Mark Zuckerberg reference voice sometimes sounds decent but can feel rushed, while other settings lead to generated silence or truncated output. The Lex Friedman-style voice can capture “vibe” and accent shifts as temperature changes, yet still shows cases where text disappears and the model outputs silence. A built-in female voice sample appears to work more consistently, and tuning temperature can shift the output between American-leaning and British-leaning pronunciations.

Overall, MetaVoice 1B looks promising for open-source TTS and voice cloning, particularly because it targets emotional prosody and claims to reduce hallucinations. But the hands-on experience suggests it still struggles with completeness and robustness: it may generate silence or omit words even when hallucination is reduced. The most anticipated next step is the release of fine-tuning scripts, which could let users train on longer recordings (beyond the 30-second reference) to improve fidelity—potentially narrowing the gap with leading proprietary systems, though the current results still fall short of top-tier commercial voice models like SoundStorm and OpenAI’s offerings.

Cornell Notes

MetaVoice 1B is an Apache-licensed, 1.2B-parameter text-to-speech model released with an open GitHub repo and a notebook workflow for generating speech and testing voice cloning. It’s trained on 100,000 hours of speech data and is marketed for zero-shot cloning of American and British accents using about 30 seconds of reference audio. The model blends causal and non-causal Transformers with a multi-band diffusion process and a “deep filter net” to reduce unwanted artifacts. In hands-on tests, hallucinated words are less common than in some prior TTS systems, but generations often drop content—sometimes producing silence—depending heavily on temperature and guidance scale. Stability and output completeness remain key limitations until fine-tuning tools arrive.

What does MetaVoice 1B claim to deliver for voice cloning, and what data does it require?

MetaVoice 1B is positioned for zero-shot voice cloning of American and British voices using roughly 30 seconds of reference audio. The model is also described as trained on 100,000 hours of speech data, which is presented as a major scale advantage for a startup release. The workflow in the repo supports selecting a “speaker conditional path” that points to a reference voice file (WebM or MP3) and then generating speech from user-provided text.

How does the model’s architecture combine Transformers and diffusion?

The described architecture uses both causal and non-causal Transformers to produce intermediate representations. Those representations then feed into a multi-band diffusion process. A separate neural component called a “deep filter net” is used afterward to remove out-of-sound or other unwanted artifacts, aiming to improve output quality and reduce problematic generation behaviors.

Which generation parameters most affect the output quality, and why?

Temperature and guidance scale are treated as the main levers. Temperature changes the model’s style and accent tendencies—at some settings it leans more American, while other settings shift toward British. Guidance scale also affects how strongly the generation follows the conditioning; adjusting both can improve resemblance for some voices, but the same tuning can also cause missing text or silence for others.

What failure mode shows up even when hallucinated words are reduced?

Instead of inventing plausible words, the model can drop content entirely. In multiple tests, the output becomes silent or omits segments of the intended text. This suggests that while “hallucination” may be less frequent, completeness and robustness still depend on the chosen voice reference and parameter settings.

How does reference voice choice influence results?

Different reference voices behave differently under the same workflow. The built-in female voice sample appears to generate more cleanly, while the Zuckerberg-style and Lex Friedman-style references show more variability—sometimes sounding decent but rushed, and other times producing silence or missing audio. The creator’s experiments indicate that some voices may be better aligned with the model’s training or conditioning pipeline.

What’s the practical next step that could improve fidelity?

Fine-tuning scripts are described as “coming soon.” Once available, users could fine-tune the model using longer recordings than the initial 30-second reference—potentially 5, 10, or 20 minutes—to get closer to a real voice. That fine-tuning step is framed as the path to better consistency and higher resemblance than zero-shot cloning alone.

Review Questions

How do temperature and guidance scale interact to change accent and intelligibility in MetaVoice 1B outputs?
What specific generation problem can occur even when hallucinated words are reduced, and how does it appear in the test results?
Why might reference voice choice (built-in vs uploaded celebrity-like samples) lead to different reliability outcomes?

Key Points

1
MetaVoice 1B is an Apache-licensed, 1.2B-parameter open TTS model released with a GitHub repo and notebook workflow for experimentation.
2
The model is marketed for zero-shot voice cloning of American and British accents using about 30 seconds of reference audio.
3
MetaVoice 1B combines causal/non-causal Transformers with a multi-band diffusion stage and a “deep filter net” to reduce unwanted artifacts.
4
Hands-on generations often avoid hallucinated words, but they can drop content—producing silence or missing segments—depending on settings.
5
Temperature and guidance scale strongly influence accent character and output completeness, with different reference voices responding differently.
6
The repo workflow supports WebM or MP3 reference uploads and requires installing dependencies such as Flash Attention for smoother runs.
7
Upcoming fine-tuning scripts are expected to improve fidelity by allowing training on longer audio than the initial 30-second reference.

Highlights

MetaVoice 1B targets zero-shot voice cloning using ~30 seconds of reference audio, with claims of emotional prosody and fewer hallucinated words.

The architecture described mixes Transformers (causal and non-causal) with multi-band diffusion and a deep filter net for artifact removal.

In practice, the biggest weakness isn’t word invention—it’s missing content, including silence, that varies with temperature, guidance scale, and the chosen reference voice.

Fine-tuning support is positioned as the key upgrade path for higher fidelity, potentially using 5–20 minutes of audio. 

Topics

Text To Speech
Voice Cloning
Open Source
Diffusion TTS
Fine Tuning