MetaVoice 1B - TTS & Voice Cloning
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
MetaVoice 1B is an Apache-licensed, 1.2B-parameter open TTS model released with a GitHub repo and notebook workflow for experimentation.
Briefing
MetaVoice has released MetaVoice 1B, a 1.2B-parameter, Apache-licensed text-to-speech model aimed at open experimentation—along with a GitHub repo and a Colab-style workflow for generating speech and testing voice cloning. The headline promise is zero-shot voice cloning for American and British accents using roughly 30 seconds of reference audio, trained on an unusually large 100,000 hours of speech data for a startup. MetaVoice also positions the system around “emotional speech” (rhythm and tone) in English and claims it avoids “hallucinations” like inventing plausible-sounding words—an issue that earlier TTS models, including Bark in the creator’s prior tests, sometimes exhibited.
Architecturally, MetaVoice 1B blends multiple Transformer components—both causal and non-causal—followed by a multi-band diffusion stage. A separate neural module described as a “deep filter net” is used to remove unwanted artifacts (including out-of-sound elements). In practice, the model is offered with a demo and then validated through hands-on generation: the system can produce recognizable celebrity-like voices, but reliability varies. Some generations succeed cleanly; others crash or error, suggesting that usage load, runtime constraints, or edge-case inputs may affect stability.
The practical test centers on a notebook that installs required dependencies (including Flash Attention) and lets users upload or point to reference voice files (WebM or MP3, with the workflow assuming 30 seconds per voice). Users then generate speech by adjusting key parameters: the speaker conditional path (which selects the reference voice), the input text, temperature, guidance scale, and an output path. The creator focuses on temperature and guidance scale because they strongly influence accent flavor and intelligibility.
Results are mixed in a specific way. The model often avoids outright hallucinated words, but it can drop content—producing silence or missing segments—especially with certain voices and parameter settings. For example, the Mark Zuckerberg reference voice sometimes sounds decent but can feel rushed, while other settings lead to generated silence or truncated output. The Lex Friedman-style voice can capture “vibe” and accent shifts as temperature changes, yet still shows cases where text disappears and the model outputs silence. A built-in female voice sample appears to work more consistently, and tuning temperature can shift the output between American-leaning and British-leaning pronunciations.
Overall, MetaVoice 1B looks promising for open-source TTS and voice cloning, particularly because it targets emotional prosody and claims to reduce hallucinations. But the hands-on experience suggests it still struggles with completeness and robustness: it may generate silence or omit words even when hallucination is reduced. The most anticipated next step is the release of fine-tuning scripts, which could let users train on longer recordings (beyond the 30-second reference) to improve fidelity—potentially narrowing the gap with leading proprietary systems, though the current results still fall short of top-tier commercial voice models like SoundStorm and OpenAI’s offerings.
Cornell Notes
MetaVoice 1B is an Apache-licensed, 1.2B-parameter text-to-speech model released with an open GitHub repo and a notebook workflow for generating speech and testing voice cloning. It’s trained on 100,000 hours of speech data and is marketed for zero-shot cloning of American and British accents using about 30 seconds of reference audio. The model blends causal and non-causal Transformers with a multi-band diffusion process and a “deep filter net” to reduce unwanted artifacts. In hands-on tests, hallucinated words are less common than in some prior TTS systems, but generations often drop content—sometimes producing silence—depending heavily on temperature and guidance scale. Stability and output completeness remain key limitations until fine-tuning tools arrive.
What does MetaVoice 1B claim to deliver for voice cloning, and what data does it require?
How does the model’s architecture combine Transformers and diffusion?
Which generation parameters most affect the output quality, and why?
What failure mode shows up even when hallucinated words are reduced?
How does reference voice choice influence results?
What’s the practical next step that could improve fidelity?
Review Questions
- How do temperature and guidance scale interact to change accent and intelligibility in MetaVoice 1B outputs?
- What specific generation problem can occur even when hallucinated words are reduced, and how does it appear in the test results?
- Why might reference voice choice (built-in vs uploaded celebrity-like samples) lead to different reliability outcomes?
Key Points
- 1
MetaVoice 1B is an Apache-licensed, 1.2B-parameter open TTS model released with a GitHub repo and notebook workflow for experimentation.
- 2
The model is marketed for zero-shot voice cloning of American and British accents using about 30 seconds of reference audio.
- 3
MetaVoice 1B combines causal/non-causal Transformers with a multi-band diffusion stage and a “deep filter net” to reduce unwanted artifacts.
- 4
Hands-on generations often avoid hallucinated words, but they can drop content—producing silence or missing segments—depending on settings.
- 5
Temperature and guidance scale strongly influence accent character and output completeness, with different reference voices responding differently.
- 6
The repo workflow supports WebM or MP3 reference uploads and requires installing dependencies such as Flash Attention for smoother runs.
- 7
Upcoming fine-tuning scripts are expected to improve fidelity by allowing training on longer audio than the initial 30-second reference.