Get AI summaries of any video or article — Sign up free
How to automate inbound phone calls | Voice AI Agent · n8n · Twilio · Ultravox thumbnail

How to automate inbound phone calls | Voice AI Agent · n8n · Twilio · Ultravox

Alex, PhD AI·
5 min read

Based on Alex, PhD AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Create the Ultravox agent first with a business-specific system prompt, formal/professional voice settings, and conversation scripts for lead qualification and follow-up.

Briefing

A practical voice-AI pipeline now links inbound phone calls to an Ultravox conversational agent, with n8n acting as the glue and Twilio handling the telephony. When a customer calls a Twilio number, Twilio sends a webhook into an n8n workflow; n8n then calls Ultravox to create a live media stream and returns that stream URL back to Twilio. From that point, Twilio streams the caller’s audio directly to Ultravox, enabling a scripted agent to speak with the caller in real time—complete with prompts, tools, and retrieval-augmented generation (RAG).

The build starts inside Ultravox by creating an agent tailored to a specific business. Using a real-estate investment agency as the example, the agent is configured with a formal, professional male voice, English language output with a German accent, and a clear call purpose: qualify inbound leads, answer general questions, book appointments, explain investment options, and collect contact details. The agent is also instructed to transfer to a human when requested and to respond only to inbound calls. A system prompt is generated using a “prompt builder” approach, then refined with business facts such as portfolio size and entry investment thresholds (e.g., luxury vacation property co-ownership starting around €169,000, plus details like total portfolio value and property count).

To make the agent knowledgeable about the company, the workflow adds RAG sourced from the agency’s website. A new source is created from the domain with depth set to include sublinks, then the data is processed into chunks and vectors. In the example run, the website parsing produced 182 pages, split into 350 text chunks, and converted into 1,213 vectors. After the RAG collection finishes processing, the agent is updated to use it, so answers can draw from the company’s actual materials rather than generic training.

The second half of the system is the n8n workflow. It begins with a webhook node designed to receive Twilio’s production webhook when a call arrives. The workflow then uses an HTTP Request node to invoke Ultravox’s “create agent call” API endpoint, passing the agent ID and authentication headers (including an Ultravox API key stored in n8n). A key implementation detail is a JavaScript step that constructs TwiML—Twilio’s XML—embedding the Ultravox-generated media stream URL along with identifiers like the call SID and caller number. If Ultravox is unreachable, the workflow falls back to a hard-coded apology message.

Once the workflow is activated, a test call demonstrates the end-to-end behavior: the agent greets the caller, asks about investment goals, probes for budget and timeline, and then offers options aligned to the caller’s stated €50,000 budget—suggesting alternatives such as real estate funds or diversified property portfolios. It also requests personal contact information for follow-up and appointment booking, showing how lead qualification can be automated without sacrificing the conversational flow.

Cornell Notes

The system automates inbound phone calls by routing Twilio audio into an Ultravox voice agent, with n8n orchestrating the handoff. A Twilio webhook triggers an n8n workflow, which calls Ultravox to create a media stream and returns the stream URL back to Twilio via TwiML. In Ultravox, the agent is configured with a business-specific system prompt, a tool (e.g., a hang up action), and RAG built from the company website. In the example, the RAG pipeline processed 182 pages into 350 chunks and 1,213 vectors, enabling the agent to answer property and investment questions grounded in the source material. The result is a scripted, real-time lead-qualifying phone conversation that can collect contact details and book next steps.

How does the call actually move from Twilio to Ultravox?

Twilio receives an inbound call and hits an n8n webhook. n8n then calls Ultravox’s “create agent call” HTTP API to generate a media stream URL. n8n returns TwiML to Twilio that includes that media stream URL (plus call identifiers like Call SID and caller number). After Twilio receives the TwiML, it streams the caller’s audio directly to Ultravox for live conversation.

What must be configured in Ultravox to make the agent business-ready?

Ultravox needs an agent with (1) role/personality settings (formal/professional tone, male voice, English with German accent), (2) a system prompt that defines the agent’s responsibilities (qualify leads, answer general questions, explain investment options, collect contact info, transfer to a human when asked), and (3) scripts that guide the conversation flow (ask about budget/timeline, discuss property-specific options, direct qualified leads to sales, offer appointment booking and email follow-up).

How does RAG get added, and what does “processed” mean in this setup?

A RAG source is created from the company website domain with depth set so sublinks are included. The collection then processes into text chunks and vector embeddings. In the example run, the website produced 182 pages, 350 chunks, and 1,213 vectors; the agent is then linked to that RAG collection so answers can be grounded in the website content.

What role does n8n play beyond triggering the Ultravox API call?

n8n not only triggers Ultravox via an HTTP Request node, but also builds the TwiML response using a JavaScript step. That code inserts the Ultravox media stream URL and Twilio identifiers (like Call SID and caller number) into the TwiML object. It also includes a fallback response if the Ultravox call fails, so callers hear a graceful “assistant unavailable” message.

What does the TwiML construction need to include for the integration to work?

The TwiML payload needs the join/media stream URL returned by the Ultravox “create agent call” step, plus identifiers such as the customer/caller context (e.g., caller number) and the call ID/Call SID. The workflow checks that the URL is not null/empty before constructing the TwiML; otherwise it returns a fallback response.

How does the agent behave during the sample call?

The agent greets the caller, asks about investment goals, and then probes for budget and time frame. When the caller mentions a €50,000 budget, the agent responds with options that fit that range (e.g., real estate funds for diversification or property portfolios for income), then requests full name, phone number, and email for follow-up and appointment booking.

Review Questions

  1. What sequence of requests and responses connects an inbound Twilio call to an Ultravox media stream?
  2. Which Ultravox settings (prompt, tools, RAG) are essential for the agent to qualify leads and answer property questions accurately?
  3. In the n8n workflow, what is the purpose of the JavaScript step that generates TwiML, and what happens when the Ultravox API call fails?

Key Points

  1. 1

    Create the Ultravox agent first with a business-specific system prompt, formal/professional voice settings, and conversation scripts for lead qualification and follow-up.

  2. 2

    Add RAG by ingesting the company website domain (including sublinks) and wait for the collection to finish processing into chunks and vectors.

  3. 3

    Use n8n as the orchestrator: a webhook receives Twilio’s inbound call event, then an HTTP Request node calls Ultravox’s “create agent call” API with the agent ID.

  4. 4

    Return TwiML to Twilio that embeds the Ultravox-generated media stream URL, along with call identifiers like Call SID and caller number, so Twilio can stream audio to Ultravox.

  5. 5

    Store the Ultravox API key in n8n and send it via header authentication; ensure the header name/value formatting matches the required “X API key” convention.

  6. 6

    Implement a fallback response in the n8n JavaScript logic so callers receive a clear message if Ultravox is unavailable.

  7. 7

    Test end-to-end with a real inbound call to confirm the agent can ask budget/timeline questions, propose investment options, and collect contact details.

Highlights

Twilio → n8n webhook → Ultravox media stream URL → TwiML back to Twilio is the core handoff that enables real-time voice conversation.
RAG grounded in the company website is built into the agent, turning 182 pages into 350 chunks and 1,213 vectors in the example run.
The n8n JavaScript step generates TwiML that includes the Ultravox join/media stream URL; without it, Twilio can’t connect to the agent.
The sample conversation shows lead qualification in action: budget and time frame questions followed by tailored investment options and contact-data capture.

Topics

  • Voice AI Agents
  • Twilio Webhooks
  • n8n Workflows
  • Ultravox API
  • RAG Knowledge Base

Mentioned

  • n8n
  • Twilio
  • Ultravox
  • Flex Funds
  • Blackthornne Academy
  • Alex PhD AI
  • API
  • RAG