AI Powered Voice Acting? - The FIRST LLM Designed for TTS
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Octave is positioned as an LLM built specifically for TTS, enabling emotion and acting-style control rather than only word-to-audio conversion.
Briefing
Hume AI’s Octave is being pitched as the first large language model built specifically for text-to-speech—one that doesn’t just read words, but follows acting-style instructions to deliver emotion, tone, and performance cues. In hands-on tests, the model can shift delivery mid-monologue (including sarcasm, frustration, and “voice trembles”) in a way that feels closer to voice acting than traditional TTS, where emotion is often limited to preset styles or post-processing.
The practical difference shows up in how Octave responds to script and direction. Instead of treating input as plain text to be converted into audio, Octave uses natural-language descriptions and “acting instructions” to steer performance. In demos, a beauty-blogger-style prompt produced a lively, expressive delivery that sounded like conversational narration, while a “monster in a cave” scenario only landed when the script and voice description matched closely—suggesting the model is sensitive to semantic alignment between what’s written and how it’s supposed to be performed.
Benchmarks cited by Hume place Octave ahead of 11 Labs in description-match and audio quality, with naturalness nearly tied. But the more revealing results came from live experimentation: Octave handled emotional transitions well, yet struggled with voice consistency. In a goblin monologue that moved from anger to exhaustion, the voice sometimes shifted pitch and character feel between paragraphs, making it sound like slightly different performers rather than one stable character. The tester tried tightening structure and refining acting prompts (e.g., “voice acting from an animated film”) to improve continuity, with partial improvement.
The workflow also matters. The “playground” encourages quick iteration, but the tester found that inserting stage directions in parentheses inside the script often led to them being treated as literal text. Moving those cues into a dedicated “acting instructions” field produced cleaner results. For longer or more controlled performances, the tester switched to “projects,” splitting the monologue into sections and assigning different acting directions per segment. That approach improved pacing and emotional nuance, though the underlying consistency issue still appeared.
A second character test—an old wizard ranting about modern magic—highlighted both strengths and limitations. Octave could sustain high-energy sarcasm and deliver the intended annoyance early on, but the energy sometimes sagged later in the playground version. Sectioning the dialogue into a project restored control and made the performance feel more coherent.
Overall, Octave is positioned as a strong fit for emotion-driven voice acting, character narration, and listen-first experiences where expressiveness matters more than perfectly stable timbre. For tasks like audiobook-style transcription or textbook reading—where consistent voice identity is critical—traditional TTS may still be preferable. Pricing is framed as competitive, with plans starting around $3 per month, and Octave offering a limited free credit allotment for early testing. Hume acknowledged the consistency problem and indicated it should improve in the coming weeks without requiring a new model version.
Cornell Notes
Octave by Hume AI is presented as a large language model designed specifically for text-to-speech, with the ability to follow acting-style instructions and deliver emotion more like a human performance. In tests, it can change tone mid-sentence and respond to semantic alignment between the script and the intended character (e.g., beauty blogger vs. monster-in-a-cave). The biggest weakness observed is voice consistency: emotional shifts can come with noticeable pitch/character changes across paragraphs. Using the platform’s acting-instructions field (rather than parenthetical cues in the script) and splitting longer dialogue into sections via “projects” improves control. The model is positioned as best for expressive narration and voice acting, while traditional TTS may still win for consistent voice playback like school transcription.
What makes Octave different from conventional text-to-speech systems in this walkthrough?
Why did the “monster in a cave” example work better when the script and voice description matched?
What was the practical lesson about where to put stage directions like “voice trembles”?
What problem most limited the goblin performance, even when emotion transitions were strong?
How did using “projects” instead of the “playground” change the outcome for the wizard rant?
When does the walkthrough suggest Octave is the better choice versus traditional TTS?
Review Questions
- What role do “acting instructions” play in Octave’s output, and how does it differ from placing cues inside the script?
- What specific failure mode did the tester observe regarding voice consistency, and what steps were tried to mitigate it?
- Why might splitting a long monologue into sections (projects) improve results compared with a single continuous script (playground)?
Key Points
- 1
Octave is positioned as an LLM built specifically for TTS, enabling emotion and acting-style control rather than only word-to-audio conversion.
- 2
Semantic alignment between the character description and the script content strongly affects whether the performance matches the intended role.
- 3
Stage directions should go into the dedicated acting-instructions field; parenthetical cues inside the script can be misinterpreted as literal text.
- 4
Octave can shift emotion mid-monologue effectively, but voice consistency (stable pitch/character identity across paragraphs) is a key weakness observed in testing.
- 5
Using “projects” to split dialogue into sections with different acting instructions improves pacing and control for longer performances.
- 6
For emotion-driven voice acting and character narration, Octave is presented as a leading option; for consistent-voice transcription, traditional TTS may still be preferable.
- 7
Pricing is framed as competitive, with entry-level plans around $3 per month and a limited free credit allotment for early trials.