OpenAI Realtime API - The NEW ERA of Speech to Speech? - TESTED
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
A persistent WebSocket connection is the backbone for low-latency streaming in both voice and text modes.
Briefing
OpenAI’s Realtime API can deliver genuinely interactive “speech-to-speech” style experiences—built by wiring a persistent WebSocket—while also supporting fast text streaming and even simultaneous voice+text. In a hands-on build, the developer demonstrates a voice persona that responds in real time, a text version that types back instantly over the same WebSocket connection, and a combined interface where voice and text run side by side. The practical takeaway is that low-latency, interruptible conversational UX is achievable, but engineering details (especially interruption during voice) still need work.
The demo starts with a voice app that uses configurable “personas” and a small knowledge base embedded into the prompt. One persona is instructed to behave like a fast robot; another is given an Irish traveler identity with a strong dialect and phrase style. When prompted, the system answers with content that reflects both the persona instructions and the supplied channel-related context (including details about the “All About AI” YouTube channel). The developer notes the setup is somewhat painful—code is described as “spaghetti”—but the core interaction works.
Switching to text, the same real-time architecture becomes even clearer: messages stream quickly because the WebSocket stays open. The developer tests a long response and then interrupts it by pressing “stop,” with the model halting mid-generation and acknowledging the interruption. That interruption behavior is presented as a standout capability for text, even if it’s not yet fully solved for the voice version.
A combined mode then runs voice and text simultaneously. The system generates text output (including code-like responses) while voice provides spoken explanations. The developer confirms the combined setup is fast, but flags a missing feature: interruption during the voice portion isn’t working yet. They plan to revisit interruption logic over the weekend.
Cost becomes the other major headline. While experimenting, the developer reports spending roughly $15 on real-time usage during testing, with the bill later reaching about $38—suggesting that Realtime API usage can ramp quickly even without a full product. They recommend having a strong use case before building something larger, and caution that experimentation can get expensive.
Beyond the core demo, the developer experiments with instruction-following by forcing the model to respond with rhyming words and then builds a simple letter-chain game where each response must match the last letter of the previous name. The model performs “mostly” as required—sometimes using synonyms—then the game ends when a response violates the rule. Finally, they outline future directions: a voice-controlled phone-call attempt to a real company (with function calling and safety constraints), deeper work on AI agents with tools, and waiting on upcoming multimodal features (vision plus voice) because early experimentation may not be worth the cost until modalities mature.
Overall, the Realtime API looks promising for low-latency conversational interfaces, but the demo underscores two realities: interruption control is still uneven across voice vs. text, and real-time usage costs can accumulate fast without careful throttling and product-level justification.
Cornell Notes
OpenAI’s Realtime API supports low-latency, interactive chat experiences by keeping a WebSocket open, enabling both voice and text streaming. In the build shown, text responses stream quickly and can be interrupted mid-generation with a “stop” action, while voice interruption is not fully implemented yet. The developer also demonstrates persona switching (e.g., “robot” vs. Irish traveler) and prompt-based knowledge injection to steer responses. A combined voice+text mode runs both channels at once, but interruption remains the main gap. Cost is a major constraint: testing alone reportedly drove a bill from about $15 to roughly $38, so experimentation needs tight use-case planning.
What technical mechanism makes the text experience feel “instant” in the demo?
How does the demo steer the model’s behavior beyond plain conversation?
What interruption behavior works reliably, and what still doesn’t?
What evidence is given about instruction-following quality?
Why is cost treated as a central limitation rather than a footnote?
What future directions are proposed, and what’s driving the sequencing?
Review Questions
- Which parts of the demo demonstrate WebSocket-driven streaming, and how does that differ from interruption behavior?
- How do persona instructions and embedded knowledge base content change the model’s responses in the voice and text modes?
- What constraints—technical and financial—limit what can be built immediately, and what roadmap is suggested to address them?
Key Points
- 1
A persistent WebSocket connection is the backbone for low-latency streaming in both voice and text modes.
- 2
Text generation can be interrupted mid-stream using a stop action, but voice interruption is not yet reliably implemented.
- 3
Persona switching (e.g., robot vs. Irish traveler) and prompt-embedded knowledge can meaningfully shape tone, dialect, and content.
- 4
Running voice and text simultaneously is feasible and remains fast, but shared UX features like interruption need additional engineering.
- 5
Realtime API experimentation can become expensive quickly; testing reportedly drove a bill from about $15 to roughly $38.
- 6
Instruction-following improves under tight constraints (rhymes, one-word outputs, letter-chain rules) but still shows occasional deviations.
- 7
Planned next steps include function-calling phone-call attempts, tool-based voice agents, and waiting for multimodal (vision+voice) capabilities to mature.