Get AI summaries of any video or article — Sign up free
Gemini Embedding 2 - Audio, Text, Images, Docs, Videos thumbnail

Gemini Embedding 2 - Audio, Text, Images, Docs, Videos

Sam Witteveen·
6 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemini Embedding 2 is presented as a single natively multimodal embedding model that embeds text, images, audio, PDFs, and videos (up to 2 minutes) into one shared vector space.

Briefing

Gemini Embedding 2 is positioned as a single, natively multimodal embedding model that collapses what used to require many separate pipelines—one model per modality, multiple vector stores, and extra fusion or reranking logic. Instead of converting audio to text or images to captions and then searching across separate indexes, Gemini Embedding 2 can embed text, images, up to 2-minute videos, audio files, and PDFs directly into a shared vector space. The practical payoff is straightforward: one API call, one index, and one query can retrieve semantically similar content across modalities—text-to-image, image-to-video, audio-to-video, and more—without stitching together five different systems.

At the core is the embedding concept: the model turns any supported input—whether a sentence, a photo, a clip, a recording, or a PDF—into a high-dimensional vector that captures semantic meaning. Similar vectors land close together, so similarity search becomes a cross-modal retrieval mechanism. The transcript emphasizes that these representations live in thousands of dimensions (not just a small 3D intuition), enabling fine-grained matching such as recognizing not only “cats,” but also visual attributes like black-and-white patterns and specific facial characteristics.

The model’s multimodal capability goes beyond “choose one modality.” It can accept multiple modalities in a single request—for example, combining an image and a text description to produce one embedding representing their joint meaning. That enables workflows like searching for watch videos using an embedding derived from “a watch description plus a watch-band image,” even when the exact target image isn’t available. The demo described in the transcript illustrates this behavior: image queries return both matching images and relevant videos, and speech queries return images and videos containing the referenced subject.

For long-form content, the transcript notes a practical strategy: chunking. While the model accepts videos up to 2 minutes, longer videos (hours) can be split into smaller segments (e.g., 15–30 seconds) so text queries can retrieve more precise timestamps. A similar idea applies to building searchable university course libraries: encode lesson video, audio, and slide PDFs, then ask questions like which lessons covered a specific topic.

The transcript also details constraints and implementation choices. Text inputs support up to 8,000 tokens, and the model can take up to six images per request. For PDFs, embeddings can be generated either via a files API or directly from bytes using the correct MIME type (application/pdf). When multiple pieces of content are involved—such as a social post with text plus an image—there’s a choice between producing one aggregated embedding for the whole post or generating separate embeddings per part.

On performance, Google’s published benchmarks are referenced without deep walkthrough: Gemini Embedding 2 is said to improve text-to-text similarity versus Gemini embedding 001 and to outperform other multimodal baselines for image-to-text and text-to-image. A key technical feature is Matrioska representation learning, which allows smaller embedding sizes (e.g., half or quarter of 3,072) when full dimensionality isn’t needed—trading storage and speed for potentially less fine-grained semantics.

Finally, the transcript highlights ecosystem readiness: Gemini Embedding 2 is available through Gemini API for AI Studio and Vertex AI, with day-zero support mentioned for agentic frameworks like LangChain and LlamaIndex and vector databases such as Chroma DB and Cudrant. A notebook walkthrough reinforces the mechanics—using the Google Gen AI SDK, embedding content via client.models, and computing similarity scores across modalities—while the overall message stays consistent: multimodal search becomes simpler, faster to maintain, and more unified when everything lands in one embedding space.

Cornell Notes

Gemini Embedding 2 is described as a single multimodal embedding model that can embed text, images, audio, PDFs, and videos (up to 2 minutes) directly into one shared vector space. That design replaces earlier setups that required separate embedding models, multiple vector stores, and extra fusion/reranking to search across modalities. The transcript explains embeddings as high-dimensional semantic “addresses,” where similarity search retrieves content with related meaning—so a text query can return images, videos, audio, or PDFs, and an image or audio query can return text and other media. It also supports multi-part inputs (e.g., image + text in one request) and offers Matrioska representation learning to output smaller embeddings for speed and storage tradeoffs. The practical result is one index and one query for cross-modal retrieval.

Why does a single multimodal embedding space matter for search systems?

Earlier multimodal search pipelines typically used separate embedding models per modality (text, image, audio) and separate vector stores, then relied on fusion or reranking to combine results. Gemini Embedding 2 aims to collapse that into one shared vector space where text, images, audio, video, and PDFs land together. That means one index can support cross-modal retrieval—e.g., a text query can retrieve semantically similar images, videos, audio, and documents—without converting audio to text or images to captions first.

How do embeddings turn different media into something searchable?

The transcript frames embeddings as vectors in a high-dimensional space (thousands of dimensions). The model converts each input—sentence, photo, audio clip, video, or PDF—into a list of numbers that encode semantic information. Similar content produces vectors that sit near each other, so similarity lookup becomes the retrieval mechanism. It also notes that the vectors are not just 3D intuition; the representation size is large enough to capture details like visual attributes.

What does “natively multimodal” enable beyond basic text-to-image search?

The model can embed multiple modalities directly, including video up to 2 minutes and audio without transcription. It can also accept multiple modalities in a single request (e.g., an image plus text) to produce an embedding representing their combined meaning. That enables joint queries such as using an image of a watch band together with a watch description to find related watch videos or images even when the exact target image isn’t present.

How should long videos be handled for more precise retrieval?

Although the model accepts videos up to 2 minutes, the transcript suggests chunking longer videos into smaller segments (for example, 15–30 seconds). Each chunk gets embedded, then a text query can retrieve the most relevant segments. Smaller chunks improve specificity—such as narrowing down when a particular event (e.g., a woman in a red dress) appears in the video.

What input limits and embedding-generation options are mentioned?

Text inputs support up to 8,000 tokens, and up to six images can be passed at once. Video inputs are limited to 2 minutes. For PDFs, embeddings can be created either through a files API or directly from bytes, provided the correct MIME type is supplied (application/pdf).

When building RAG indexes, should a system store one embedding per item or multiple embeddings per part?

The transcript describes two approaches. For a single combined item like a social post with text plus an image, an aggregated embedding can represent the whole post by averaging or combining parts into one embedding. Alternatively, separate embeddings can be stored for each part (e.g., multiple images) to allow more granular retrieval. The choice affects how RAG results map to user intent and how specific the matching can be.

Review Questions

  1. What problems do separate modality-specific embedding models and multiple vector stores create, and how does a shared embedding space address them?
  2. How would you design an index for a 25-hour university course using Gemini Embedding 2, given the video length limit and the need for precise answers?
  3. What tradeoffs come with using Matrioska representation learning to output smaller embeddings instead of the full embedding size?

Key Points

  1. 1

    Gemini Embedding 2 is presented as a single natively multimodal embedding model that embeds text, images, audio, PDFs, and videos (up to 2 minutes) into one shared vector space.

  2. 2

    Cross-modal retrieval becomes simpler: one index and one query can return semantically similar results across modalities without transcription or format conversion steps.

  3. 3

    The model supports multi-modal requests (e.g., image + text together) to embed combined meaning, enabling joint-query use cases like “describe + show” retrieval.

  4. 4

    Long videos can be handled by chunking into smaller segments (e.g., 15–30 seconds) to improve timestamp-level specificity for text queries.

  5. 5

    Input constraints include up to 8,000 tokens for text and up to six images per request, with video limited to 2 minutes per input.

  6. 6

    PDF embeddings can be generated via a files API or from bytes using the correct MIME type (application/pdf).

  7. 7

    Matrioska representation learning allows smaller embedding sizes (half or quarter of 3,072) to trade fine-grained semantics for faster lookup and reduced storage.

Highlights

Gemini Embedding 2 aims to replace multi-model, multi-index multimodal search pipelines with one API call, one index, and one shared embedding space.
Audio retrieval can work without transcription, and video retrieval can work without converting clips to another format (up to 2 minutes per input).
Matrioska representation learning offers smaller embeddings to speed similarity search when full 3,072-dimensional vectors aren’t necessary.
Chunking longer videos into 15–30 second segments is suggested to make text queries return more precise moments in the clip.

Topics

  • Multimodal Embeddings
  • Cross-Modal Search
  • RAG Indexing
  • Video Chunking
  • Matrioska Representation Learning

Mentioned