OpenAI DevDay | Realtime Speech to Speech API + Image Fine-tuning TESTED
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI’s Realtime Speech-to-Speech API targets low-latency voice experiences by avoiding a speech→text→speech middle step that can strip emotion and emphasis.
Briefing
OpenAI’s DevDay announcements center on a new Realtime Speech-to-Speech API aimed at letting developers build voice experiences with low latency—without the usual multi-step pipeline that converts speech to text and back again. Instead of stitching together transcription and synthesis, the API supports native speech-to-speech output, keeps nuance like tone and inflection, and can handle function calling. The tradeoff is cost: audio output is priced at roughly $200 per 1 million audio output tokens (the transcript also notes about “$0.25 per minute” of output), making experimentation feasible but production deployments something developers will need to budget carefully.
A key motivation for the Realtime API is latency and emotional fidelity. The older approach—speech-to-text (e.g., Whisper), then text-to-speech—tends to strip away emphasis, emotion, and accent because the model only “sees” text in the middle. With Realtime, the system can produce more natural, nuanced speech directly from audio or from text prompts, and it can accept different input modes: sending user text, sending user audio, or streaming user audio over a websocket connection. The transcript also highlights that this design can reduce engineering complexity compared with keeping a separate connection open for other components, while still enabling fast, interactive behavior.
Alongside speech, DevDay introduced Vision fine-tuning for GPT-4o, plus two other API efficiency features: prompt caching and model distillation. Prompt caching is described as automatically applied to GPT-4o mini (in the transcript, “GPT 4o mini” is referenced as “latest version” for caching), reducing repeated work when prompts recur. Model distillation is positioned as a way to create more cost-efficient models, particularly useful when building datasets and optimizing for price.
The transcript then shifts from announcements to hands-on testing. For Realtime, the tester runs instruction-following prompts through a LiveKit playground, including a “comedian” persona that laughs loudly and a keyword-extraction task that returns only the most important terms from a spoken rant. A roleplay prompt (“act like a parrot,” mimic words, and laugh) demonstrates that the system can follow style and behavioral constraints in speech. The results are presented as promising—especially for interactive, voice-first applications—while the cost and access limitations (a 403 error when trying to run a local Node setup with the API key) remain practical blockers.
For Vision fine-tuning, the tester builds a small dataset from Reddit screenshots, aiming to extract structured fields—title and Reddit upvotes/uploads—while ignoring surrounding “fluff.” The workflow uses a JSON/CSV-style training format with an image URL and a question prompt, then fine-tunes GPT-4o. Multiple training attempts fail due to moderation or accessibility issues, but a successful run produces a fine-tuned “Reddit extractor” model that returns the expected structured output in the playground. The tester reports slow inference (around ~20 seconds) but strong format adherence and correct numeric extraction on follow-up examples.
Overall, the transcript frames Realtime Speech-to-Speech as the most immediately transformative capability for building fast, natural voice agents, while Vision fine-tuning is shown as viable for narrow, high-precision extraction tasks—provided dataset curation and moderation hurdles are managed. Prompt caching and distillation are treated as complementary tools for lowering latency and cost as developers scale.
Cornell Notes
OpenAI’s DevDay introduces a Realtime Speech-to-Speech API designed to deliver low-latency, natural-sounding voice output without the usual speech→text→speech pipeline. The transcript emphasizes that this approach can preserve nuance like tone and inflection and supports interactive features such as function calling. Developers also get Vision fine-tuning for GPT-4o, plus prompt caching (automatically applied for GPT-4o mini) and model distillation for cost-efficient models. Hands-on testing shows Realtime can follow spoken-style instructions (comedian, keyword extractor, parrot roleplay), while Vision fine-tuning can reliably extract structured fields (e.g., Reddit title and upvotes) from screenshots after overcoming dataset moderation/access issues.
Why does a Realtime Speech-to-Speech API matter compared with a speech→text→speech pipeline?
What practical cost and engineering constraints show up when using the Realtime API?
How does prompt caching reduce compute, and where is it applied?
What does Vision fine-tuning enable in the transcript’s example use case?
What went wrong during dataset preparation for Vision fine-tuning, and what succeeded anyway?
What performance characteristics were observed for the fine-tuned Vision model?
Review Questions
- What specific limitations of the speech→text→speech approach does the Realtime API aim to overcome, and how does that affect emotional fidelity?
- In the Vision fine-tuning workflow described, what is the structure of the training example (inputs and expected outputs), and why does the tester prefer a CSV-like target format?
- What kinds of failures can occur during Vision fine-tuning dataset ingestion (e.g., moderation/accessibility), and how did the tester recover to reach a successful fine-tune?
Key Points
- 1
OpenAI’s Realtime Speech-to-Speech API targets low-latency voice experiences by avoiding a speech→text→speech middle step that can strip emotion and emphasis.
- 2
Realtime supports multiple input modes (text, audio, streamed audio) and enables interactive behaviors like function calling, but it often requires websocket-based streaming for audio.
- 3
Realtime usage can be expensive, with audio output pricing around $200 per 1 million audio output tokens, so production plans need cost controls.
- 4
Prompt caching is positioned as automatic for GPT-4o mini, reducing repeated computation when prompts recur.
- 5
Vision fine-tuning for GPT-4o can be used for narrow, high-precision extraction tasks (e.g., pulling Reddit title and upvote/upload counts from screenshots).
- 6
Dataset preparation for Vision fine-tuning can fail due to moderation or accessibility issues; repeated retries and corrected links may be necessary.
- 7
Even when fine-tuning succeeds, inference latency in the playground can be noticeable (the transcript cites ~20 seconds for an image query).