The Gemini Interactions API
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini Interactions API is built for agent-style workflows, not just stateless prompt/response calls.
Briefing
Google’s new Gemini Interactions API reframes how developers build with Gemini models by shifting from simple, stateless “prompt in, text out” calls toward agent-ready, multimodal, tool-using workflows—while also adding optional server-side conversation state to cut token costs.
The API’s design reflects how LLM usage has changed since early completions endpoints. Back then, developers sent a prompt and received a completion with no built-in conversation memory. Chat-style APIs later introduced roles (system, user, assistant) to better manage context, still largely stateless but more structured. Function calling then forced additional API changes as models began triggering external actions. By 2025, the center of gravity has moved again: “agents” now run multi-step loops, call tools, execute code, and often coordinate multiple model calls—meaning the old conversational scaffolding is no longer enough.
Gemini Interactions API targets that agent reality directly. One of its first upgrades is optional server-side history. Instead of resending the entire conversation every time, developers can reference a prior interaction ID to persist context on Google’s side. That persistence enables implicit caching for long prompts, improving token efficiency and reducing cost during multi-turn interactions. The transcript demonstrates this with a multi-turn exchange where the model remembers a user’s name across turns, and then shows token usage rising more slowly once prompts exceed roughly a thousand tokens—consistent with backend caching behavior.
The API also expands beyond “model calls” into “agent calls.” Developers can invoke Gemini research agents—specifically the Gemini deep research pro preview agent (December 2025). The agent runs server-side, supports background execution for long-running tasks, and returns an interaction ID immediately. Clients can poll for completion or failure and then fetch the final output once the agent finishes. In the example, the deep research agent produces a structured markdown history of Jax and includes citations, illustrating the agent’s ability to do multi-step research work.
Multimodality is treated as both input and output. On the input side, the API accepts images, audio, video, and PDFs either via file handling or by base64-encoding content. On the output side, developers can request generated artifacts such as images (and similarly audio/video) by specifying response modalities. The transcript highlights an image-generation example using the Nano Banana Pro model.
For structured outputs, the API leans into schema-first responses using Pydantic model definitions and JSON schema validation, making it easier to reliably parse model outputs into typed data structures.
Finally, the interactions API unifies tool use. It supports standard function calling patterns, built-in tools like Google Search, code execution, and URL context retrieval, plus remote MCP servers for integrating external services (e.g., a weather MCP). Background execution, server-side state, and tool orchestration together position the API as a foundation for more complex agent systems—such as “computer use” agents—built to run in sandboxes and handle multi-step tasks without keeping a client connection open.
Cornell Notes
Gemini Interactions API updates Gemini development from stateless chat calls toward agent-ready workflows. It adds optional server-side conversation history via interaction IDs, enabling implicit caching that can reduce token costs in long, multi-turn sessions. The API also supports calling agents (including the Gemini deep research pro preview agent) with background execution and polling for completion. Multimodal input/output is built in (images, audio, PDFs, and generated images), and structured outputs are supported through schema validation using Pydantic. Tool use is unified through function calling, built-in tools (search, code execution, URL context), and remote MCP servers.
What problem does server-side conversation history solve, and how does it affect token usage?
How does the API support long-running agent tasks without holding a client connection open?
What’s different about multimodal handling in the interactions API?
How do structured outputs work, and why does schema validation matter?
What tool-related capabilities does the API provide beyond basic function calling?
What limitation appears with citations/links from the deep research agent?
Review Questions
- How does referencing an interaction ID change the way developers manage conversation context and cost compared with resending full chat history?
- Describe the background execution flow for an agent call, including what the client receives immediately and how results are retrieved.
- What combination of features (multimodal I/O, structured outputs, and tool calling) would you use to build a reliable research-and-reporting agent?
Key Points
- 1
Gemini Interactions API is built for agent-style workflows, not just stateless prompt/response calls.
- 2
Optional server-side history via interaction IDs enables multi-turn memory and can trigger implicit caching for long prompts.
- 3
Agents can run in the background, returning an interaction ID first and requiring polling to fetch results later.
- 4
Multimodal support includes both understanding (image/audio/PDF inputs) and generation (requesting image/audio/video outputs via response modalities).
- 5
Structured outputs are schema-driven using Pydantic and JSON schema validation for more dependable parsing.
- 6
Tool orchestration is unified through function calling, built-in tools (search, code execution, URL context), and remote MCP servers.
- 7
Deep research agent citations may not preserve raw clickable URLs for export, which can complicate reporting workflows.