Get AI summaries of any video or article — Sign up free
BIG UPDATE: AI Agent Now Calls And Book Appointments - OpenAI Realtime API thumbnail

BIG UPDATE: AI Agent Now Calls And Book Appointments - OpenAI Realtime API

All About AI·
4 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

OpenAI Realtime API updates add speech-to-speech calling with five new expressive, steerable voices.

Briefing

A new OpenAI Realtime API update is making AI phone agents more practical and more natural at booking appointments—now with speech-to-speech calling, more expressive steerable voices, and lower operating costs via prompt caching. In live examples, the agent calls dental offices, asks about availability for a specific day, and then extracts structured details like the business name and whether appointments exist, even when callers don’t provide all the context the agent later references.

One recorded call with Lumia Dental shows the agent successfully navigating a typical scheduling flow: it asks for Friday availability, clarifies the visit type (“cleaning and a checkup”), and offers specific time slots (9:45 or 12:30). The system then captures the outcome—an appointment is available—along with the company name. Across multiple attempts, results were strong: eight conversations were run, with six completed successfully and two failures, and the agent’s “completion” rate was described as high.

The update’s technical headline is speech-to-speech capability powered by five new voices. Compared with earlier voice options, these voices are described as more expressive and easier to steer, producing speech that sounds more natural during the back-and-forth of a phone call. Cost is the other major lever: prompt caching reduces pricing for cash text inputs by 50% and for other inputs by 80%, which the creator frames as a key step toward making autonomous calling financially viable.

The setup relies on a pipeline that turns audio into text and then pulls out only the information needed. Calls are transcribed using Whisper, and structured outputs from OpenAI are used to extract fields such as the company name, whether an appointment is available, and the relevant day. The agent’s behavior is guided by a system message and simple instructions—introduce itself, ask for appointments on Friday, and handle follow-ups like “call back to confirm.”

In practice, the agent can also handle partial or unexpected prompts. Studio Smiles Dental Office responded that Friday was fully booked; the agent remained polite, and the extracted data still captured the key fact: no appointments available. A second “best” example with Expert Dental was more nuanced. The caller never explicitly said it was a new patient or specified “Friday the 8th,” yet the agent still asked about being a new patient and later confirmed “Friday the 8th,” then offered times (9:00 a.m., 10:00 a.m., or 2:00 p.m.). The creator flags an ethical unease—whether this kind of assumption or improvisation is acceptable—but argues it may become normal as these systems spread.

Overall, the update positions AI calling as a near-term automation tool for appointment scheduling: it can conduct realistic phone conversations, extract structured scheduling data reliably, and do so with improved voice quality and reduced cost—while raising new questions about consent, transparency, and how much the agent should infer beyond what the caller provides.

Cornell Notes

OpenAI’s Realtime API update enables AI agents to handle speech-to-speech phone calls with five new voices that are more expressive and steerable. In appointment-booking trials, the agent calls dental offices, asks about Friday availability, and uses Whisper transcription plus OpenAI structured outputs to extract key details like company name and whether appointments exist. The system also benefits from prompt caching, cutting costs for cached text inputs (50%) and other inputs (80%), making autonomous calling more viable. Examples show the agent can succeed even when callers omit details, though that improvisation raises ethical questions about assumptions during real phone interactions.

What changed in the OpenAI Realtime API that improves real phone-calling agents?

The update adds speech-to-speech experiences using five new voices. These voices are described as more expressive and more steerable than earlier options, producing speech that sounds more natural during live conversations. It also introduces prompt caching price reductions: cached text inputs are discounted 50%, and other inputs are discounted 80%, which lowers the cost of running the agent.

How does the agent turn a phone call into structured appointment data?

The pipeline transcribes the call audio into text using Whisper. From that transcription, OpenAI structured outputs extract only the needed fields—such as the company name, whether an appointment is available, and the relevant day/time information. The extracted results are then stored and reviewed after calls.

What did the Lumia Dental call demonstrate about the agent’s scheduling workflow?

The agent asked for Friday availability and clarified the visit type as a cleaning and checkup. It then offered specific times (9:45 or 12:30). The captured outcome indicated that an appointment was available, and the company name was recorded as part of the structured output.

How did the agent handle a fully booked office?

In the Studio Smiles Dental Office call, the office said Friday was fully booked. Even so, the agent remained polite and the structured extraction still captured the key scheduling fact: no appointments available.

Why did the Expert Dental example raise an ethical concern?

The caller did not explicitly mention being a new patient or specify “Friday the 8th,” but the agent still asked whether the caller was a current patient and later confirmed “Friday the 8th.” It then offered times (9:00 a.m., 10:00 a.m., or 2:00 p.m.). The concern is whether the agent’s assumptions or improvisation are appropriate in real-world phone interactions.

Review Questions

  1. What roles do Whisper transcription and OpenAI structured outputs play in the appointment-calling pipeline?
  2. How do prompt caching discounts (50% and 80%) change the economics of running an autonomous calling agent?
  3. In the examples, what kinds of missing or implied user details did the agent successfully infer—and why might that be ethically sensitive?

Key Points

  1. 1

    OpenAI Realtime API updates add speech-to-speech calling with five new expressive, steerable voices.

  2. 2

    Prompt caching reduces costs: cached text inputs are discounted 50% and other inputs are discounted 80%.

  3. 3

    The calling system uses Whisper to transcribe audio and OpenAI structured outputs to extract scheduling fields like company name and appointment availability.

  4. 4

    In trial calls, the agent captured outcomes such as available time slots (e.g., Lumia Dental) and “fully booked” status (e.g., Studio Smiles).

  5. 5

    The agent can sometimes infer missing details during conversation, which can improve outcomes but raises ethical questions about assumptions.

  6. 6

    Across eight conversations, six completed successfully, with two failures reported.

Highlights

The agent successfully booked a dental appointment by asking about Friday availability, confirming the visit type, and offering specific times (9:45 or 12:30).
Prompt caching is positioned as a major cost breakthrough, with 50% and 80% discounts depending on input type.
Even when an office was fully booked, the agent still extracted the key scheduling result while staying polite.
An Expert Dental call showed the agent asking about “new patient” status and confirming “Friday the 8th” without those details being explicitly provided.

Topics

Mentioned