OpenAI DevDay 2024 | Multimodal apps with the Realtime API
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The Realtime API is designed for low-latency speech-in, speech-out by using a model that natively understands and generates speech.
Briefing
OpenAI’s Realtime API is built to deliver natural, low-latency “speech-in, speech-out” experiences through a single interface—removing the multi-step glue work that previously made voice assistants feel slow and brittle. Instead of stitching together separate systems for speech capture, transcription, text generation, and text-to-speech, the Realtime API uses a model that natively understands and generates speech. That design enables faster turn-taking, more fluid interruption behavior, and richer audio nuance because the system can process audio directly rather than converting everything into text first.
The session contrasts the “old way” with the new approach. Traditional pipelines typically require a user-gesture or end-of-speech detection step, then transcription (using OpenAI’s Whisper model), then a language model response (example given: GPT-4o), and finally a third model to synthesize speech. Each stage adds delay and can degrade conversational flow—especially when users interrupt mid-response. The Realtime API collapses those steps into a single real-time stream, leveraging the same underlying speech capability that powers advanced voice mode in ChatGPT.
A key technical shift is how the API maintains responsiveness: apps open a stateful WebSocket connection to a Realtime endpoint (V1 realtime). Audio input can be streamed as it’s spoken, while model output audio is streamed back as soon as it’s produced. Messages use JSON types that include audio deltas for playback and events that signal when the user starts speaking. That event support is what makes interruptions practical: when a user cuts in, the client stops the currently playing audio and sends interruption metadata (such as the played offset and context identifiers) so the model can align its next output with what was actually heard.
The session also highlights practical developer building blocks. A live coding walkthrough shows a browser-based client using the Web Audio pipeline: it decodes PCM16 audio (24 kHz) returned by the API, plays it, and records microphone audio to send back as input audio buffer append messages. The API supports different audio codecs (including G711), and it can handle function calling alongside speech.
To demonstrate real-world integration, a 3D solar-system tutoring app uses tool calls to trigger visual changes when users ask about planets and to display charts when answers depend on structured data. The app also fetches live information—such as the International Space Station’s current latitude/longitude from a real-time endpoint—and feeds that data back into the conversation via a tool workflow. The model can stream tool-related information progressively, such as showing Pluto’s moons while the tool call expands.
Finally, the session addresses cost and rollout. The Realtime API is in public beta, currently supporting speech, text, and function calling, with more modalities planned. It also introduces prompt caching for text and audio inputs: cached text inputs cost 50% less, cached audio inputs cost 80% less, and a typical 15-minute conversation is expected to cost about 30% less than at launch. The overall message is that speech-to-speech apps can be built with lower latency, better interruption handling, and deeper app integration—setting the stage for more natively multimodal capabilities in future frontier-model releases.
Cornell Notes
The Realtime API enables low-latency, natural voice experiences by using a model that natively understands and generates speech. That removes the need to stitch together separate transcription, text generation, and text-to-speech steps, which previously added delay and made interruptions harder. A stateful WebSocket connection streams audio in real time and streams audio back as it’s generated, with explicit events to detect when the user starts speaking so playback can be interrupted cleanly. The API also supports function calling, letting voice interactions trigger app actions like 3D navigation, charts, and live data lookups (e.g., International Space Station position). Prompt caching for text and audio inputs reduces costs for repeated inputs.
Why did voice assistants feel awkward before the Realtime API, and what changes with native speech understanding?
How does the Realtime API keep interactions responsive in practice?
What makes interruptions work, and what information must the client send?
What does a minimal browser implementation need to do?
How do tool calls extend voice into interactive apps?
What cost reductions were announced via prompt caching?
Review Questions
- How does streaming audio in both directions (WebSocket + audio deltas) change the user experience compared with waiting for a full response?
- What specific event and metadata are needed to implement natural interruption behavior in a speech-to-speech client?
- How do tool calls (function calling) turn a voice assistant from a Q&A system into an interactive application?
Key Points
- 1
The Realtime API is designed for low-latency speech-in, speech-out by using a model that natively understands and generates speech.
- 2
The API avoids multi-model pipelines (Whisper transcription → language model → text-to-speech), which previously increased delay and reduced conversational fluidity.
- 3
A stateful WebSocket connection (V1 realtime) streams audio input to the server and streams audio output back as it’s produced.
- 4
Interruption support relies on detecting when the user starts speaking and sending interruption metadata (including played offset and context identifiers) so the model can respond coherently.
- 5
Function calling enables voice interactions to trigger app actions like 3D navigation, chart rendering, and live data fetches.
- 6
Prompt caching lowers costs: cached text inputs cost 50% less, cached audio inputs cost 80% less, with an estimated ~30% reduction for a typical 15-minute conversation.
- 7
The public beta currently supports speech, text, and function calling, with additional modalities planned.