BIG UPDATE: AI Agent Now Calls And Book Appointments - OpenAI Realtime API
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI Realtime API updates add speech-to-speech calling with five new expressive, steerable voices.
Briefing
A new OpenAI Realtime API update is making AI phone agents more practical and more natural at booking appointments—now with speech-to-speech calling, more expressive steerable voices, and lower operating costs via prompt caching. In live examples, the agent calls dental offices, asks about availability for a specific day, and then extracts structured details like the business name and whether appointments exist, even when callers don’t provide all the context the agent later references.
One recorded call with Lumia Dental shows the agent successfully navigating a typical scheduling flow: it asks for Friday availability, clarifies the visit type (“cleaning and a checkup”), and offers specific time slots (9:45 or 12:30). The system then captures the outcome—an appointment is available—along with the company name. Across multiple attempts, results were strong: eight conversations were run, with six completed successfully and two failures, and the agent’s “completion” rate was described as high.
The update’s technical headline is speech-to-speech capability powered by five new voices. Compared with earlier voice options, these voices are described as more expressive and easier to steer, producing speech that sounds more natural during the back-and-forth of a phone call. Cost is the other major lever: prompt caching reduces pricing for cash text inputs by 50% and for other inputs by 80%, which the creator frames as a key step toward making autonomous calling financially viable.
The setup relies on a pipeline that turns audio into text and then pulls out only the information needed. Calls are transcribed using Whisper, and structured outputs from OpenAI are used to extract fields such as the company name, whether an appointment is available, and the relevant day. The agent’s behavior is guided by a system message and simple instructions—introduce itself, ask for appointments on Friday, and handle follow-ups like “call back to confirm.”
In practice, the agent can also handle partial or unexpected prompts. Studio Smiles Dental Office responded that Friday was fully booked; the agent remained polite, and the extracted data still captured the key fact: no appointments available. A second “best” example with Expert Dental was more nuanced. The caller never explicitly said it was a new patient or specified “Friday the 8th,” yet the agent still asked about being a new patient and later confirmed “Friday the 8th,” then offered times (9:00 a.m., 10:00 a.m., or 2:00 p.m.). The creator flags an ethical unease—whether this kind of assumption or improvisation is acceptable—but argues it may become normal as these systems spread.
Overall, the update positions AI calling as a near-term automation tool for appointment scheduling: it can conduct realistic phone conversations, extract structured scheduling data reliably, and do so with improved voice quality and reduced cost—while raising new questions about consent, transparency, and how much the agent should infer beyond what the caller provides.
Cornell Notes
OpenAI’s Realtime API update enables AI agents to handle speech-to-speech phone calls with five new voices that are more expressive and steerable. In appointment-booking trials, the agent calls dental offices, asks about Friday availability, and uses Whisper transcription plus OpenAI structured outputs to extract key details like company name and whether appointments exist. The system also benefits from prompt caching, cutting costs for cached text inputs (50%) and other inputs (80%), making autonomous calling more viable. Examples show the agent can succeed even when callers omit details, though that improvisation raises ethical questions about assumptions during real phone interactions.
What changed in the OpenAI Realtime API that improves real phone-calling agents?
How does the agent turn a phone call into structured appointment data?
What did the Lumia Dental call demonstrate about the agent’s scheduling workflow?
How did the agent handle a fully booked office?
Why did the Expert Dental example raise an ethical concern?
Review Questions
- What roles do Whisper transcription and OpenAI structured outputs play in the appointment-calling pipeline?
- How do prompt caching discounts (50% and 80%) change the economics of running an autonomous calling agent?
- In the examples, what kinds of missing or implied user details did the agent successfully infer—and why might that be ethically sensitive?
Key Points
- 1
OpenAI Realtime API updates add speech-to-speech calling with five new expressive, steerable voices.
- 2
Prompt caching reduces costs: cached text inputs are discounted 50% and other inputs are discounted 80%.
- 3
The calling system uses Whisper to transcribe audio and OpenAI structured outputs to extract scheduling fields like company name and appointment availability.
- 4
In trial calls, the agent captured outcomes such as available time slots (e.g., Lumia Dental) and “fully booked” status (e.g., Studio Smiles).
- 5
The agent can sometimes infer missing details during conversation, which can improve outcomes but raises ethical questions about assumptions.
- 6
Across eight conversations, six completed successfully, with two failures reported.