Get AI summaries of any video or article — Sign up free
AI Powered Voice Acting? - The FIRST LLM Designed for TTS thumbnail

AI Powered Voice Acting? - The FIRST LLM Designed for TTS

MattVidPro·
5 min read

Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Octave is positioned as an LLM built specifically for TTS, enabling emotion and acting-style control rather than only word-to-audio conversion.

Briefing

Hume AI’s Octave is being pitched as the first large language model built specifically for text-to-speech—one that doesn’t just read words, but follows acting-style instructions to deliver emotion, tone, and performance cues. In hands-on tests, the model can shift delivery mid-monologue (including sarcasm, frustration, and “voice trembles”) in a way that feels closer to voice acting than traditional TTS, where emotion is often limited to preset styles or post-processing.

The practical difference shows up in how Octave responds to script and direction. Instead of treating input as plain text to be converted into audio, Octave uses natural-language descriptions and “acting instructions” to steer performance. In demos, a beauty-blogger-style prompt produced a lively, expressive delivery that sounded like conversational narration, while a “monster in a cave” scenario only landed when the script and voice description matched closely—suggesting the model is sensitive to semantic alignment between what’s written and how it’s supposed to be performed.

Benchmarks cited by Hume place Octave ahead of 11 Labs in description-match and audio quality, with naturalness nearly tied. But the more revealing results came from live experimentation: Octave handled emotional transitions well, yet struggled with voice consistency. In a goblin monologue that moved from anger to exhaustion, the voice sometimes shifted pitch and character feel between paragraphs, making it sound like slightly different performers rather than one stable character. The tester tried tightening structure and refining acting prompts (e.g., “voice acting from an animated film”) to improve continuity, with partial improvement.

The workflow also matters. The “playground” encourages quick iteration, but the tester found that inserting stage directions in parentheses inside the script often led to them being treated as literal text. Moving those cues into a dedicated “acting instructions” field produced cleaner results. For longer or more controlled performances, the tester switched to “projects,” splitting the monologue into sections and assigning different acting directions per segment. That approach improved pacing and emotional nuance, though the underlying consistency issue still appeared.

A second character test—an old wizard ranting about modern magic—highlighted both strengths and limitations. Octave could sustain high-energy sarcasm and deliver the intended annoyance early on, but the energy sometimes sagged later in the playground version. Sectioning the dialogue into a project restored control and made the performance feel more coherent.

Overall, Octave is positioned as a strong fit for emotion-driven voice acting, character narration, and listen-first experiences where expressiveness matters more than perfectly stable timbre. For tasks like audiobook-style transcription or textbook reading—where consistent voice identity is critical—traditional TTS may still be preferable. Pricing is framed as competitive, with plans starting around $3 per month, and Octave offering a limited free credit allotment for early testing. Hume acknowledged the consistency problem and indicated it should improve in the coming weeks without requiring a new model version.

Cornell Notes

Octave by Hume AI is presented as a large language model designed specifically for text-to-speech, with the ability to follow acting-style instructions and deliver emotion more like a human performance. In tests, it can change tone mid-sentence and respond to semantic alignment between the script and the intended character (e.g., beauty blogger vs. monster-in-a-cave). The biggest weakness observed is voice consistency: emotional shifts can come with noticeable pitch/character changes across paragraphs. Using the platform’s acting-instructions field (rather than parenthetical cues in the script) and splitting longer dialogue into sections via “projects” improves control. The model is positioned as best for expressive narration and voice acting, while traditional TTS may still win for consistent voice playback like school transcription.

What makes Octave different from conventional text-to-speech systems in this walkthrough?

Octave is built as an LLM for TTS, so it can use meaning and acting-style direction to shape delivery—not just convert text to audio. The tester shows that natural-language descriptions of a voice (e.g., “charismatic,” “slightly nasal,” “expressive”) and semantic character framing influence how the model performs. It also supports “acting instructions” that control emotion and performance cues such as sarcasm, whispering, and “voice trembles,” enabling mid-monologue shifts that feel closer to acting than standard TTS.

Why did the “monster in a cave” example work better when the script and voice description matched?

When the voice description and the script content aligned, the delivery sounded closer to the intended character. A mismatch—where the script didn’t naturally fit the “monster” framing—produced a more ordinary-sounding read. The walkthrough suggests Octave’s performance depends on semantic steering: the model seems to interpret both what’s written and how it should be acted, so character cues work best when the text itself supports them.

What was the practical lesson about where to put stage directions like “voice trembles”?

Parenthetical cues embedded in the script were treated as part of the text, leading to incorrect behavior (the tester saw “voice trembles” handled as an in-text instruction rather than a performance cue). Moving those cues into the dedicated “acting instructions” field produced the intended effect. The platform’s structure—base voice + acting instructions + speech text—matters for getting clean results.

What problem most limited the goblin performance, even when emotion transitions were strong?

Emotion control came with reduced voice consistency. As the goblin monologue progressed, the voice sometimes changed pitch and sounded like a different goblin performer every other paragraph. The tester tried consolidating structure and refining acting prompts (e.g., “voice acting from an animated film”) to stabilize the character, with only partial improvement.

How did using “projects” instead of the “playground” change the outcome for the wizard rant?

In the playground, the rant started strong but energy sometimes faded later. In projects, the tester split the dialogue into sections and assigned different acting instructions per segment (e.g., angry high energy, then reminiscing frustration, then insulting/scolding). That sectioning improved pacing and made the performance feel more controlled and coherent across the full monologue.

When does the walkthrough suggest Octave is the better choice versus traditional TTS?

Octave is recommended when emotion, natural acting, and instructable delivery are the priority—such as character narration and voice-acting-style content. Traditional TTS is suggested for tasks like transcribing textbooks for school or listening to material where consistent voice identity matters more than expressive performance.

Review Questions

  1. What role do “acting instructions” play in Octave’s output, and how does it differ from placing cues inside the script?
  2. What specific failure mode did the tester observe regarding voice consistency, and what steps were tried to mitigate it?
  3. Why might splitting a long monologue into sections (projects) improve results compared with a single continuous script (playground)?

Key Points

  1. 1

    Octave is positioned as an LLM built specifically for TTS, enabling emotion and acting-style control rather than only word-to-audio conversion.

  2. 2

    Semantic alignment between the character description and the script content strongly affects whether the performance matches the intended role.

  3. 3

    Stage directions should go into the dedicated acting-instructions field; parenthetical cues inside the script can be misinterpreted as literal text.

  4. 4

    Octave can shift emotion mid-monologue effectively, but voice consistency (stable pitch/character identity across paragraphs) is a key weakness observed in testing.

  5. 5

    Using “projects” to split dialogue into sections with different acting instructions improves pacing and control for longer performances.

  6. 6

    For emotion-driven voice acting and character narration, Octave is presented as a leading option; for consistent-voice transcription, traditional TTS may still be preferable.

  7. 7

    Pricing is framed as competitive, with entry-level plans around $3 per month and a limited free credit allotment for early trials.

Highlights

Octave can follow acting-style prompts and change emotional delivery mid-monologue, producing results that feel closer to voice acting than standard TTS.
The model’s biggest observed drawback is voice consistency—emotional shifts can cause pitch/character changes that make the same character sound different across sections.
Putting cues like “voice trembles” in acting instructions (not in parentheses inside the script) is crucial for getting the intended performance behavior.
Splitting long dialogue into sections inside “projects” improves control and prevents energy from sagging later in the performance.

Topics

  • Octave TTS
  • LLM Voice Acting
  • Acting Instructions
  • Voice Consistency
  • Prompt To Voice

Mentioned