OpenAI DevDay 2024 | Multimodal apps with the Realtime API

TL;DR

The Realtime API is designed for low-latency speech-in, speech-out by using a model that natively understands and generates speech.

Briefing Cornell Notes

Briefing

OpenAI’s Realtime API is built to deliver natural, low-latency “speech-in, speech-out” experiences through a single interface—removing the multi-step glue work that previously made voice assistants feel slow and brittle. Instead of stitching together separate systems for speech capture, transcription, text generation, and text-to-speech, the Realtime API uses a model that natively understands and generates speech. That design enables faster turn-taking, more fluid interruption behavior, and richer audio nuance because the system can process audio directly rather than converting everything into text first.

The session contrasts the “old way” with the new approach. Traditional pipelines typically require a user-gesture or end-of-speech detection step, then transcription (using OpenAI’s Whisper model), then a language model response (example given: GPT-4o), and finally a third model to synthesize speech. Each stage adds delay and can degrade conversational flow—especially when users interrupt mid-response. The Realtime API collapses those steps into a single real-time stream, leveraging the same underlying speech capability that powers advanced voice mode in ChatGPT.

A key technical shift is how the API maintains responsiveness: apps open a stateful WebSocket connection to a Realtime endpoint (V1 realtime). Audio input can be streamed as it’s spoken, while model output audio is streamed back as soon as it’s produced. Messages use JSON types that include audio deltas for playback and events that signal when the user starts speaking. That event support is what makes interruptions practical: when a user cuts in, the client stops the currently playing audio and sends interruption metadata (such as the played offset and context identifiers) so the model can align its next output with what was actually heard.

The session also highlights practical developer building blocks. A live coding walkthrough shows a browser-based client using the Web Audio pipeline: it decodes PCM16 audio (24 kHz) returned by the API, plays it, and records microphone audio to send back as input audio buffer append messages. The API supports different audio codecs (including G711), and it can handle function calling alongside speech.

To demonstrate real-world integration, a 3D solar-system tutoring app uses tool calls to trigger visual changes when users ask about planets and to display charts when answers depend on structured data. The app also fetches live information—such as the International Space Station’s current latitude/longitude from a real-time endpoint—and feeds that data back into the conversation via a tool workflow. The model can stream tool-related information progressively, such as showing Pluto’s moons while the tool call expands.

Finally, the session addresses cost and rollout. The Realtime API is in public beta, currently supporting speech, text, and function calling, with more modalities planned. It also introduces prompt caching for text and audio inputs: cached text inputs cost 50% less, cached audio inputs cost 80% less, and a typical 15-minute conversation is expected to cost about 30% less than at launch. The overall message is that speech-to-speech apps can be built with lower latency, better interruption handling, and deeper app integration—setting the stage for more natively multimodal capabilities in future frontier-model releases.

Cornell Notes

The Realtime API enables low-latency, natural voice experiences by using a model that natively understands and generates speech. That removes the need to stitch together separate transcription, text generation, and text-to-speech steps, which previously added delay and made interruptions harder. A stateful WebSocket connection streams audio in real time and streams audio back as it’s generated, with explicit events to detect when the user starts speaking so playback can be interrupted cleanly. The API also supports function calling, letting voice interactions trigger app actions like 3D navigation, charts, and live data lookups (e.g., International Space Station position). Prompt caching for text and audio inputs reduces costs for repeated inputs.

Why did voice assistants feel awkward before the Realtime API, and what changes with native speech understanding?

Earlier systems often relied on a multi-step pipeline: capture speech (button or end-of-speech detection), transcribe audio (example: Whisper), generate a response with a language model (example: GPT-4o), then synthesize speech with a separate text-to-speech model. Each stage increased latency and made it difficult to react to interruptions mid-response; audio nuance could also be lost when converting speech to text. With the Realtime API, the model can process audio inputs directly and generate speech directly, allowing a single real-time interaction loop with much less delay.

How does the Realtime API keep interactions responsive in practice?

Apps open a stateful WebSocket connection to the Realtime endpoint (V1 realtime). Audio is streamed to the server as the user speaks, and model output is streamed back immediately in audio deltas. This streaming design supports liveness: the client can play partial audio as it arrives rather than waiting for a full response.

What makes interruptions work, and what information must the client send?

When the user starts speaking, the server emits an event indicating speech start (input_audio_buffer_dospeech_started). The client interrupts any audio currently playing and sends interruption metadata back to the API so the model can align with what was actually heard. The walkthrough mentions using the played offset (how much audio already played) plus context identifiers (such as current item ID and content index) to wire interruption behavior correctly.

What does a minimal browser implementation need to do?

The walkthrough uses vanilla HTML/JavaScript: create a WebSocket to the Realtime API with the model specified and an API key header for local experimentation. Handle incoming messages by switching on response.audio.delta, base64-decoding PCM16 audio (24 kHz) and piping it into a wave stream player. For input, record microphone audio and send it as input_audio_buffer.append messages (after base64 encoding).

How do tool calls extend voice into interactive apps?

Function calling lets the model trigger app-side actions. In the solar-system tutoring demo, a Focus Planet tool runs when a planet is mentioned to update the 3D view, and a display data tool chooses chart titles and data points for visualizations. Another tool fetches live International Space Station position data from a real-time endpoint, then returns latitude/longitude to the model so it can answer “where is the ISS right now” with up-to-date information.

What cost reductions were announced via prompt caching?

Prompt caching reduces costs when text or audio inputs hit the cache. Cached text inputs cost 50% less, and cached audio inputs cost 80% less. For a typical 15-minute conversation, the expected cost is about 30% less than at launch.

Review Questions

How does streaming audio in both directions (WebSocket + audio deltas) change the user experience compared with waiting for a full response?
What specific event and metadata are needed to implement natural interruption behavior in a speech-to-speech client?
How do tool calls (function calling) turn a voice assistant from a Q&A system into an interactive application?

Key Points

1
The Realtime API is designed for low-latency speech-in, speech-out by using a model that natively understands and generates speech.
2
The API avoids multi-model pipelines (Whisper transcription → language model → text-to-speech), which previously increased delay and reduced conversational fluidity.
3
A stateful WebSocket connection (V1 realtime) streams audio input to the server and streams audio output back as it’s produced.
4
Interruption support relies on detecting when the user starts speaking and sending interruption metadata (including played offset and context identifiers) so the model can respond coherently.
5
Function calling enables voice interactions to trigger app actions like 3D navigation, chart rendering, and live data fetches.
6
Prompt caching lowers costs: cached text inputs cost 50% less, cached audio inputs cost 80% less, with an estimated ~30% reduction for a typical 15-minute conversation.
7
The public beta currently supports speech, text, and function calling, with additional modalities planned.

Highlights

The Realtime API collapses speech transcription, text generation, and speech synthesis into one real-time loop, enabling faster turn-taking and more natural interruption.

A WebSocket-based, stateful connection streams audio deltas both ways, so playback can begin before the full response is finished.

Interruption isn’t just “stop audio”—the client sends interruption metadata (played offset plus context) so the model continues from the right point.

Tool calls let voice drive interactive visuals and live data, such as updating a 3D solar system and fetching International Space Station coordinates in real time.

Prompt caching for audio and text can substantially reduce costs, with cached audio priced 80% lower.

Topics

Realtime API
Speech-to-Speech
WebSocket Streaming
Interruptions
Function Calling
Prompt Caching

Mentioned

API
V1
PCM16
G711
IDE
ISS