Get AI summaries of any video or article — Sign up free
Gemini 2.0 - How to use the Live Bidirectional API thumbnail

Gemini 2.0 - How to use the Live Bidirectional API

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Start in AI Studio by selecting “stream real time” and choose the Gemini 2.0 flash model for the live bidirectional experience.

Briefing

Gemini 2.0’s Live Bidirectional API is built for real-time, two-way multimodal interaction—letting users talk back and forth with voice, stream audio/video input, and receive spoken or text responses instantly. The practical payoff is a conversational assistant that can respond while simultaneously processing what’s happening in front of the user, whether that’s a screen, an app workflow, or a live camera feed.

Getting started begins in AI Studio by selecting the “stream real time” option and choosing one of three interaction modes. The first mode is straightforward voice or text conversation with the model. Users can switch to the Gemini 2.0 flash model, pick an output format (audio or text), and select a voice. They can also add tools and—crucially—write custom system instructions to steer behavior. In the demo, a system prompt turns the assistant into a Westworld-style character (“Sarah Winters”), and the conversation follows that persona: it answers questions about recent guest interactions, describes maintenance requirements on a schedule, and agrees to power down when asked. The key point is that system instructions meaningfully shape the assistant’s tone and role, making the experience more than a generic chat.

The second mode demonstrates screen understanding. With the model connected to a chosen window, the assistant can identify what’s on-screen (the demo recognizes Figma), then guide a user through unfamiliar tasks using keyboard shortcuts and interaction tips. When the user asks how to zoom and how to drag/pan, the assistant provides concrete steps—like Command + / Control + for zooming and holding the space bar to temporarily switch to the hand tool for panning. It then helps improve a specific design card by asking clarifying questions about the “April 1st” element and suggesting the card’s content might need adjustment depending on whether it’s an event, blog post, or other use case. This mode highlights a hands-on workflow: the model can interpret UI context, answer “how do I do this?” questions, and iterate on design decisions.

The third mode brings live video into the loop. With camera access enabled, the assistant describes the scene in detail, identifies objects (including a Funko pop figurine of Bernard Lowe from Westworld), and answers follow-up questions about what it sees—such as interpreting the number of fingers held up. The interaction stays conversational while the model continues to observe, effectively combining real-time dialogue with ongoing visual description.

Beyond the live multimodal modes, the setup supports standard prompting and optional tool features like structured outputs, code execution, and function calling. The transcript also points to a unified SDK and a Gemini 2 cookbook with examples ranging from basic text prompting to the live bidirectional multimodal API. The overall message: developers can prototype quickly in AI Studio, then move the same capabilities into apps using the SDK, with Gemini 2.0 flash as the recommended starting model for experimentation.

Cornell Notes

Gemini 2.0’s Live Bidirectional API enables real-time back-and-forth interaction that can include voice, text, and multimodal inputs like screens and live camera feeds. Users start in AI Studio’s “stream real time,” choose the Gemini 2.0 flash model, set output format (audio or text), and optionally add system instructions to control persona and behavior. In the screen mode, the model can recognize apps like Figma and give step-by-step help using keyboard shortcuts (e.g., zoom with Command + / Control + and pan by holding the space bar). In live video mode, it can describe what it sees and answer questions about objects and actions while the conversation continues. This makes it useful for guided troubleshooting, tutoring, and interactive assistance in real-world workflows.

What makes Gemini 2.0’s Live Bidirectional API different from a typical chat interface?

It supports real-time, two-way multimodal streaming: voice can go back and forth instantly, and the system can accept voice and/or video (and also screen content) while returning spoken or text responses in real time. That means the assistant can keep a live conversation while simultaneously processing what the user is showing—like a window on a desktop or a camera view.

How do system instructions change the assistant’s behavior in the live conversation mode?

System instructions override the default behavior and steer the assistant into a specific role. In the demo, adding a Westworld character prompt makes the assistant respond as “Sarah Winters,” then follow through with role-consistent answers—describing guest interactions, maintenance schedules (e.g., diagnostic scans every 72 hours, weekly behavioral calibration, monthly narrative updates, quarterly core system optimization), and agreeing to power down when requested.

How does the screen-based example use multimodal understanding to help with an unfamiliar app?

After the user shares a window, the assistant identifies the app as Figma and then answers practical UI questions. It provides key commands for zooming (Command + on Mac / Control + on Windows; Command + minus / Control + minus to zoom out) and explains how to pan by holding the space bar to temporarily switch to the hand tool. It then helps improve a specific card by asking what the “April 1st” element represents and what content should appear there.

What kinds of questions work well in live video mode?

Questions that depend on visual context. The assistant describes the scene (clothing, bookshelf, decorative items), identifies a Funko pop figurine of Bernard Lowe from Westworld, and answers action-based queries like how many fingers the user is holding up—first five, then 10—while maintaining conversational flow.

What additional capabilities are available beyond multimodal streaming?

The transcript notes tool options such as structured outputs, code execution, and function calling. It also emphasizes that the same capabilities demonstrated in AI Studio can be implemented in apps via a unified SDK, with a Gemini 2 cookbook offering examples from basic prompting to the live bidirectional multimodal API.

Review Questions

  1. In what three interaction modes does the setup allow users to communicate with Gemini 2.0, and what input/output types are involved in each?
  2. How do system instructions influence the assistant’s responses in the Westworld persona example?
  3. Describe two concrete UI actions the assistant helps with in the Figma screen demo, including the specific keyboard shortcuts mentioned.

Key Points

  1. 1

    Start in AI Studio by selecting “stream real time” and choose the Gemini 2.0 flash model for the live bidirectional experience.

  2. 2

    Configure output format (audio or text) and select a voice when using real-time voice interaction.

  3. 3

    Use system instructions to control persona and behavior; the demo shows a Westworld-style character responding consistently to role-based questions.

  4. 4

    In screen mode, the model can identify apps like Figma and provide actionable guidance using keyboard shortcuts and interaction techniques (e.g., space bar for hand-tool panning).

  5. 5

    In live video mode, the model can describe the scene and answer questions about objects and actions while the conversation continues.

  6. 6

    Plan to move from AI Studio prototypes to app development using the unified SDK and the Gemini 2 cookbook examples, including the live bidirectional multimodal API.

Highlights

Gemini 2.0’s Live Bidirectional API supports real-time two-way multimodal interaction, combining voice back-and-forth with simultaneous processing of voice/video or screen/camera input.
System instructions can transform the assistant into a specific character persona, with responses staying consistent across multiple turns.
The Figma demo turns visual context into step-by-step help, including exact keyboard shortcuts for zooming and panning.
Live video mode pairs ongoing visual description with conversational Q&A, including identifying a Funko pop figurine and counting fingers.
A unified SDK and Gemini 2 cookbook provide a path from interactive testing to building the same capabilities into apps.

Topics

  • Live Bidirectional API
  • Multimodal Streaming
  • AI Studio Setup
  • Screen Assistance
  • Live Video Conversation

Mentioned