Get AI summaries of any video or article — Sign up free
OpenAI Realtime API - The NEW ERA of Speech to Speech? - TESTED thumbnail

OpenAI Realtime API - The NEW ERA of Speech to Speech? - TESTED

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

A persistent WebSocket connection is the backbone for low-latency streaming in both voice and text modes.

Briefing

OpenAI’s Realtime API can deliver genuinely interactive “speech-to-speech” style experiences—built by wiring a persistent WebSocket—while also supporting fast text streaming and even simultaneous voice+text. In a hands-on build, the developer demonstrates a voice persona that responds in real time, a text version that types back instantly over the same WebSocket connection, and a combined interface where voice and text run side by side. The practical takeaway is that low-latency, interruptible conversational UX is achievable, but engineering details (especially interruption during voice) still need work.

The demo starts with a voice app that uses configurable “personas” and a small knowledge base embedded into the prompt. One persona is instructed to behave like a fast robot; another is given an Irish traveler identity with a strong dialect and phrase style. When prompted, the system answers with content that reflects both the persona instructions and the supplied channel-related context (including details about the “All About AI” YouTube channel). The developer notes the setup is somewhat painful—code is described as “spaghetti”—but the core interaction works.

Switching to text, the same real-time architecture becomes even clearer: messages stream quickly because the WebSocket stays open. The developer tests a long response and then interrupts it by pressing “stop,” with the model halting mid-generation and acknowledging the interruption. That interruption behavior is presented as a standout capability for text, even if it’s not yet fully solved for the voice version.

A combined mode then runs voice and text simultaneously. The system generates text output (including code-like responses) while voice provides spoken explanations. The developer confirms the combined setup is fast, but flags a missing feature: interruption during the voice portion isn’t working yet. They plan to revisit interruption logic over the weekend.

Cost becomes the other major headline. While experimenting, the developer reports spending roughly $15 on real-time usage during testing, with the bill later reaching about $38—suggesting that Realtime API usage can ramp quickly even without a full product. They recommend having a strong use case before building something larger, and caution that experimentation can get expensive.

Beyond the core demo, the developer experiments with instruction-following by forcing the model to respond with rhyming words and then builds a simple letter-chain game where each response must match the last letter of the previous name. The model performs “mostly” as required—sometimes using synonyms—then the game ends when a response violates the rule. Finally, they outline future directions: a voice-controlled phone-call attempt to a real company (with function calling and safety constraints), deeper work on AI agents with tools, and waiting on upcoming multimodal features (vision plus voice) because early experimentation may not be worth the cost until modalities mature.

Overall, the Realtime API looks promising for low-latency conversational interfaces, but the demo underscores two realities: interruption control is still uneven across voice vs. text, and real-time usage costs can accumulate fast without careful throttling and product-level justification.

Cornell Notes

OpenAI’s Realtime API supports low-latency, interactive chat experiences by keeping a WebSocket open, enabling both voice and text streaming. In the build shown, text responses stream quickly and can be interrupted mid-generation with a “stop” action, while voice interruption is not fully implemented yet. The developer also demonstrates persona switching (e.g., “robot” vs. Irish traveler) and prompt-based knowledge injection to steer responses. A combined voice+text mode runs both channels at once, but interruption remains the main gap. Cost is a major constraint: testing alone reportedly drove a bill from about $15 to roughly $38, so experimentation needs tight use-case planning.

What technical mechanism makes the text experience feel “instant” in the demo?

The text version stays connected via a persistent WebSocket, so tokens/outputs can stream back without repeated request setup. That’s why the system can type quickly after each user message and continue generating until a stop signal is triggered.

How does the demo steer the model’s behavior beyond plain conversation?

It uses persona instructions and a small knowledge base embedded into the prompt. Personas include a “robot” style (fast, hyperspeed) and an “Irish traveler” style (strong dialect and Irish phrases). The knowledge base provides channel-related context so answers about the “All About AI” YouTube channel can reflect that supplied information.

What interruption behavior works reliably, and what still doesn’t?

Text interruption works: during a long streamed response, pressing stop halts generation immediately and the system acknowledges the interruption. Voice interruption is the missing piece—interrupting the voice portion isn’t working in the current version, and the developer plans to implement it later.

What evidence is given about instruction-following quality?

The developer tests constrained outputs: first, “only answer with a word that rhymes with the last word from the user,” and the model mostly follows by returning one-word rhymes, though it sometimes uses synonyms rather than perfect rhymes. Then a letter-chain game requires responding with the last letter of the previous name; the model plays but eventually fails a turn, ending the game.

Why is cost treated as a central limitation rather than a footnote?

The developer reports spending about $15 during testing and later seeing the bill rise to around $38, with a significant portion attributed to the real-time API. That experience leads to a warning: Realtime usage can become expensive quickly, so building a full app needs a strong, justified use case and careful experimentation.

What future directions are proposed, and what’s driving the sequencing?

Planned work includes (1) attempting a quick phone call to a real company using system instructions plus function calling, (2) building voice-controlled AI agents with tools, and (3) waiting on multimodal features (vision + voice) because the next Realtime API expansions are expected to add modalities over time and early multimodal experiments may be too costly.

Review Questions

  1. Which parts of the demo demonstrate WebSocket-driven streaming, and how does that differ from interruption behavior?
  2. How do persona instructions and embedded knowledge base content change the model’s responses in the voice and text modes?
  3. What constraints—technical and financial—limit what can be built immediately, and what roadmap is suggested to address them?

Key Points

  1. 1

    A persistent WebSocket connection is the backbone for low-latency streaming in both voice and text modes.

  2. 2

    Text generation can be interrupted mid-stream using a stop action, but voice interruption is not yet reliably implemented.

  3. 3

    Persona switching (e.g., robot vs. Irish traveler) and prompt-embedded knowledge can meaningfully shape tone, dialect, and content.

  4. 4

    Running voice and text simultaneously is feasible and remains fast, but shared UX features like interruption need additional engineering.

  5. 5

    Realtime API experimentation can become expensive quickly; testing reportedly drove a bill from about $15 to roughly $38.

  6. 6

    Instruction-following improves under tight constraints (rhymes, one-word outputs, letter-chain rules) but still shows occasional deviations.

  7. 7

    Planned next steps include function-calling phone-call attempts, tool-based voice agents, and waiting for multimodal (vision+voice) capabilities to mature.

Highlights

Text streaming feels “super fast” because the system keeps a WebSocket open for continuous interaction.
Pressing stop halts a long text response immediately—an interruption capability that works in text but not yet in voice.
Cost escalates quickly during testing: roughly $15 spent on real-time usage and a later bill around $38.
Persona prompts can switch speaking style and dialect, including Irish phrasing and a “robot” hyperspeed mode.
Constrained games (rhymes, last-letter chaining) show the model can follow rules most of the time, but not perfectly.

Topics