Get AI summaries of any video or article — Sign up free
The Gemini Interactions API thumbnail

The Gemini Interactions API

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemini Interactions API is built for agent-style workflows, not just stateless prompt/response calls.

Briefing

Google’s new Gemini Interactions API reframes how developers build with Gemini models by shifting from simple, stateless “prompt in, text out” calls toward agent-ready, multimodal, tool-using workflows—while also adding optional server-side conversation state to cut token costs.

The API’s design reflects how LLM usage has changed since early completions endpoints. Back then, developers sent a prompt and received a completion with no built-in conversation memory. Chat-style APIs later introduced roles (system, user, assistant) to better manage context, still largely stateless but more structured. Function calling then forced additional API changes as models began triggering external actions. By 2025, the center of gravity has moved again: “agents” now run multi-step loops, call tools, execute code, and often coordinate multiple model calls—meaning the old conversational scaffolding is no longer enough.

Gemini Interactions API targets that agent reality directly. One of its first upgrades is optional server-side history. Instead of resending the entire conversation every time, developers can reference a prior interaction ID to persist context on Google’s side. That persistence enables implicit caching for long prompts, improving token efficiency and reducing cost during multi-turn interactions. The transcript demonstrates this with a multi-turn exchange where the model remembers a user’s name across turns, and then shows token usage rising more slowly once prompts exceed roughly a thousand tokens—consistent with backend caching behavior.

The API also expands beyond “model calls” into “agent calls.” Developers can invoke Gemini research agents—specifically the Gemini deep research pro preview agent (December 2025). The agent runs server-side, supports background execution for long-running tasks, and returns an interaction ID immediately. Clients can poll for completion or failure and then fetch the final output once the agent finishes. In the example, the deep research agent produces a structured markdown history of Jax and includes citations, illustrating the agent’s ability to do multi-step research work.

Multimodality is treated as both input and output. On the input side, the API accepts images, audio, video, and PDFs either via file handling or by base64-encoding content. On the output side, developers can request generated artifacts such as images (and similarly audio/video) by specifying response modalities. The transcript highlights an image-generation example using the Nano Banana Pro model.

For structured outputs, the API leans into schema-first responses using Pydantic model definitions and JSON schema validation, making it easier to reliably parse model outputs into typed data structures.

Finally, the interactions API unifies tool use. It supports standard function calling patterns, built-in tools like Google Search, code execution, and URL context retrieval, plus remote MCP servers for integrating external services (e.g., a weather MCP). Background execution, server-side state, and tool orchestration together position the API as a foundation for more complex agent systems—such as “computer use” agents—built to run in sandboxes and handle multi-step tasks without keeping a client connection open.

Cornell Notes

Gemini Interactions API updates Gemini development from stateless chat calls toward agent-ready workflows. It adds optional server-side conversation history via interaction IDs, enabling implicit caching that can reduce token costs in long, multi-turn sessions. The API also supports calling agents (including the Gemini deep research pro preview agent) with background execution and polling for completion. Multimodal input/output is built in (images, audio, PDFs, and generated images), and structured outputs are supported through schema validation using Pydantic. Tool use is unified through function calling, built-in tools (search, code execution, URL context), and remote MCP servers.

What problem does server-side conversation history solve, and how does it affect token usage?

Instead of resending the full conversation each request, developers can pass an interaction ID from a previous call so the backend persists context. That persistence enables implicit caching: when prompts grow large (the transcript notes behavior once prompts exceed roughly a thousand tokens), token consumption rises more slowly because cached tokens can be reused. The example shows a multi-turn chat where the model remembers the user’s name across turns without re-supplying all prior text.

How does the API support long-running agent tasks without holding a client connection open?

Agent calls can run with background execution. The client receives an interaction ID immediately, then polls for status (completed/failed/in progress) at intervals. Once finished, the client fetches the final result. The deep research pro preview agent example demonstrates this by producing a Jax history after several polling cycles.

What’s different about multimodal handling in the interactions API?

Multimodal support works for both understanding and generation. For inputs, developers can send images/audio/PDFs either through file APIs or by base64-encoding the content. For outputs, developers specify response modalities (e.g., request an image output), and the API returns generated artifacts that can be saved and displayed. The transcript includes an image generation example using the Nano Banana Pro model.

How do structured outputs work, and why does schema validation matter?

Structured outputs are driven by defining typed models (using Pydantic) that map to a JSON schema. The API returns data that can be validated against that schema, making downstream parsing more reliable than extracting free-form text. The transcript describes nesting models (e.g., a moderation result containing other fields) and then validating the JSON into the defined classes.

What tool-related capabilities does the API provide beyond basic function calling?

Beyond standard tool/function calling, it includes built-in tools such as Google Search (returning URLs and citations), code execution (running generated Python server-side), and URL context retrieval (fetching content from a provided URL, with some sites potentially blocked). It also supports remote MCP servers, where the developer provides an MCP server endpoint (e.g., a weather service) and the model can call it as a tool, with tool call inputs/outputs visible in interaction outputs.

What limitation appears with citations/links from the deep research agent?

Citations are provided, but the transcript notes that the raw clickable URLs aren’t reliably preserved for reuse across sessions. Instead, citations may resolve through Vertex AI search-style redirection URLs, which can break or become non-clickable when exported (e.g., into a PDF). A workaround using request-based URL conversion is mentioned, though it may raise terms-of-service concerns.

Review Questions

  1. How does referencing an interaction ID change the way developers manage conversation context and cost compared with resending full chat history?
  2. Describe the background execution flow for an agent call, including what the client receives immediately and how results are retrieved.
  3. What combination of features (multimodal I/O, structured outputs, and tool calling) would you use to build a reliable research-and-reporting agent?

Key Points

  1. 1

    Gemini Interactions API is built for agent-style workflows, not just stateless prompt/response calls.

  2. 2

    Optional server-side history via interaction IDs enables multi-turn memory and can trigger implicit caching for long prompts.

  3. 3

    Agents can run in the background, returning an interaction ID first and requiring polling to fetch results later.

  4. 4

    Multimodal support includes both understanding (image/audio/PDF inputs) and generation (requesting image/audio/video outputs via response modalities).

  5. 5

    Structured outputs are schema-driven using Pydantic and JSON schema validation for more dependable parsing.

  6. 6

    Tool orchestration is unified through function calling, built-in tools (search, code execution, URL context), and remote MCP servers.

  7. 7

    Deep research agent citations may not preserve raw clickable URLs for export, which can complicate reporting workflows.

Highlights

Server-side interaction history lets developers persist conversation context without resending everything, and the transcript ties token savings to implicit caching once prompts get large.
Background execution turns long research or code-heavy agent runs into asynchronous jobs: start, poll, then retrieve the final output.
Multimodal generation is requestable: specifying response modalities can produce generated images (e.g., with Nano Banana Pro) as part of the interaction output.
Remote MCP servers extend Gemini tools beyond built-ins, letting the model call external services like a weather endpoint through a standardized interface.
Schema-first structured outputs using Pydantic shift reliability from “read the text” to “validate the JSON.”

Topics

Mentioned