They Beat Open AI to the Punch... But at What Cost?

TL;DR

mhi is presented as a native multimodal voice model that can listen and speak in real time, with cited ~200 ms latency.

Briefing Cornell Notes

Briefing

A new open-access multimodal voice model called “mhi” is being positioned as a fast follow to OpenAI’s gp4 Omni voice demo—offering real-time listening and speech in a single model—but the live tests show it’s far from reliable intelligence. The big draw isn’t that it matches gp4 Omni today; it’s that the system is expected to become meaningfully better once researchers and developers can download and modify the underlying paper, code, and models.

The transcript frames mhi as a “native multimodal foundation model” that can both understand and generate audio, with claimed emotion/tonality awareness. It’s described as built from joint pre-training on mixed text and audio synthetic data, then fine-tuned on roughly 100,000 “oral style” synthetic conversations converted from text-to-speech. In practice, the model’s voice sounds decent and the latency is cited at about 200 milliseconds, but the conversation repeatedly derails: it cuts out, talks over the user, and struggles with tasks that require stable reasoning or consistent behavior.

During a series of demonstrations, mhi is pushed to prove emotion sensing and expressive capability. It initially responds with confident-sounding guesses about the user’s tone, then fails to deliver on requests like singing on command. When asked to sing a butterfly song, it produces lyrics that are sometimes passable, but the interaction becomes chaotic—interruptions, repeated restarts, and the user having to manage the model’s timing and output. In another exchange, mhi claims it can help with a choking incident by suggesting water and fresh air, but the broader conversation still shows limited competence and frequent breakdowns.

The transcript also compares mhi with other voice-adjacent assistants. Pi AI is contrasted as a system that can generate a more polished “butterflies” song, though it can’t directly sing out loud and relies on separate components (text generation plus text-to-speech). ChatGPT is mentioned as having voice features and memory, but without the same direct multimodal voice setup in this test. The overall takeaway is that mhi’s voice experience feels closer to gp4 Omni’s “talk like a person” promise, yet its intelligence and controllability lag behind.

Still, the transcript treats open sourcing as the decisive factor. Even with a rough demo today, releasing the model and code could let the community scale it up, improve training, and reduce the failure modes seen in the live session. The creator’s conclusion is blunt: mhi isn’t a gp4 Omni competitor right now, but it may become one over time—especially as open-source developers iterate on multimodal voice systems that are expected to dominate the next wave of AI products.

Cornell Notes

mhi is presented as an open-access, native multimodal voice model that can listen and speak in real time, with cited ~200 ms latency and a voice that sounds fairly competitive. Built from joint text-audio pretraining and fine-tuned on large amounts of synthetic “oral style” conversation data, it’s marketed as capable of emotion/tonality understanding. In live tests, it often cuts off, talks over the user, and struggles with tasks like sustained singing or consistent reasoning. The transcript’s main point is that today’s demo looks “rough,” but open-source release could enable the community to scale and improve it into something more usable. It’s framed as a follow-up to gp4 Omni’s voice experience rather than a match for its current intelligence.

What makes mhi stand out compared with typical voice assistants?

mhi is described as a “native multimodal foundation model” that listens and generates audio/speech in real time within one system, aiming for the same kind of conversational, voice-first interaction associated with gp4 Omni. The transcript also claims emotion/tonality awareness, and it cites about 200 milliseconds of latency, which supports the “talk like a person” feel even when the model’s reasoning is weak.

Why does the transcript treat open sourcing as the most important part of the story?

The demo is repeatedly shown as unreliable—cutting off, looping, and failing at tasks like singing on command. The transcript argues that the real leverage comes from access to the paper, code, and models: once the community can download and modify them, scaling and training improvements could reduce the rough edges and make the system meaningfully better over time.

How did mhi perform on emotion sensing tests?

mhi sometimes produced plausible-sounding guesses about the user’s tone (e.g., nervousness, tension, sadness), but the interaction also devolved into loops and “cheating” accusations when the user provided the correct label. That suggests the emotion feature may be inconsistent, overly sensitive to context, or not robust enough for dependable interpretation.

How did mhi handle singing requests, and what does that imply?

When asked to sing a butterfly song, mhi generated lyrics and attempted to continue, but the session included interruptions, restarts, and the user having to push repeatedly for performance. The transcript contrasts this with Pi AI, whose butterfly song text-to-speech output was judged smoother, implying mhi’s voice generation may be less controllable even if it sounds decent.

What comparisons were made to other AI voice experiences?

Pi AI is described as using separate components (text generation plus text-to-speech), producing a better butterfly song but lacking direct emotion sensing and the ability to “sing out loud.” ChatGPT is mentioned as having voice features and memory, but in this test it’s treated as not offering the same multimodal voice setup as mhi. The comparisons reinforce that mhi feels closer to gp4 Omni’s voice interaction, while still lagging in intelligence and reliability.

Review Questions

What training ingredients (pretraining and fine-tuning data) are cited as powering mhi’s multimodal voice behavior?
Which failure modes appear most often in the transcript’s mhi demonstrations (e.g., cutoffs, talking over the user, looping, inability to follow requests)?
Why does open-source access matter more than the current demo quality in the transcript’s overall assessment?

Key Points

1
mhi is presented as a native multimodal voice model that can listen and speak in real time, with cited ~200 ms latency.
2
The model’s current demo shows weak reliability: it cuts off, talks over the user, and struggles with sustained or precise tasks.
3
Emotion/tonality sensing is claimed, but live tests show inconsistent results and conversational loops.
4
mhi’s voice quality is described as decent, yet singing and controllability fall short compared with some text-to-speech-based alternatives like Pi AI.
5
The transcript’s central thesis is that open-source release (paper, code, and models) could let the community scale and improve the system substantially.
6
Comparisons suggest mhi is closer to gp4 Omni’s “voice-first” feel than other assistants, but not a direct match for gp4 Omni’s intelligence today.

Highlights

mhi’s most compelling promise is open-source access: even a rough demo could improve quickly once developers can iterate on the model.

Live emotion-sensing tests often sound confident but can drift into loops or misinterpretations when the user challenges the output.

The singing demo illustrates the gap between “real-time voice” and “reliable, controllable performance.”

Pi AI’s butterfly song is judged smoother despite lacking the same direct multimodal emotion sensing, underscoring that voice quality and conversational intelligence are separate problems.

Topics

Open Source AI
Multimodal Voice
Emotion Sensing
Real-Time Latency
Voice Model Comparisons

Mentioned

Philip Schmid
AGI
LLM
M1
GPT
GPT-2
GPT-3
GPT-4
TTS