Gemini Embedding 2 - Audio, Text, Images, Docs, Videos
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini Embedding 2 is presented as a single natively multimodal embedding model that embeds text, images, audio, PDFs, and videos (up to 2 minutes) into one shared vector space.
Briefing
Gemini Embedding 2 is positioned as a single, natively multimodal embedding model that collapses what used to require many separate pipelines—one model per modality, multiple vector stores, and extra fusion or reranking logic. Instead of converting audio to text or images to captions and then searching across separate indexes, Gemini Embedding 2 can embed text, images, up to 2-minute videos, audio files, and PDFs directly into a shared vector space. The practical payoff is straightforward: one API call, one index, and one query can retrieve semantically similar content across modalities—text-to-image, image-to-video, audio-to-video, and more—without stitching together five different systems.
At the core is the embedding concept: the model turns any supported input—whether a sentence, a photo, a clip, a recording, or a PDF—into a high-dimensional vector that captures semantic meaning. Similar vectors land close together, so similarity search becomes a cross-modal retrieval mechanism. The transcript emphasizes that these representations live in thousands of dimensions (not just a small 3D intuition), enabling fine-grained matching such as recognizing not only “cats,” but also visual attributes like black-and-white patterns and specific facial characteristics.
The model’s multimodal capability goes beyond “choose one modality.” It can accept multiple modalities in a single request—for example, combining an image and a text description to produce one embedding representing their joint meaning. That enables workflows like searching for watch videos using an embedding derived from “a watch description plus a watch-band image,” even when the exact target image isn’t available. The demo described in the transcript illustrates this behavior: image queries return both matching images and relevant videos, and speech queries return images and videos containing the referenced subject.
For long-form content, the transcript notes a practical strategy: chunking. While the model accepts videos up to 2 minutes, longer videos (hours) can be split into smaller segments (e.g., 15–30 seconds) so text queries can retrieve more precise timestamps. A similar idea applies to building searchable university course libraries: encode lesson video, audio, and slide PDFs, then ask questions like which lessons covered a specific topic.
The transcript also details constraints and implementation choices. Text inputs support up to 8,000 tokens, and the model can take up to six images per request. For PDFs, embeddings can be generated either via a files API or directly from bytes using the correct MIME type (application/pdf). When multiple pieces of content are involved—such as a social post with text plus an image—there’s a choice between producing one aggregated embedding for the whole post or generating separate embeddings per part.
On performance, Google’s published benchmarks are referenced without deep walkthrough: Gemini Embedding 2 is said to improve text-to-text similarity versus Gemini embedding 001 and to outperform other multimodal baselines for image-to-text and text-to-image. A key technical feature is Matrioska representation learning, which allows smaller embedding sizes (e.g., half or quarter of 3,072) when full dimensionality isn’t needed—trading storage and speed for potentially less fine-grained semantics.
Finally, the transcript highlights ecosystem readiness: Gemini Embedding 2 is available through Gemini API for AI Studio and Vertex AI, with day-zero support mentioned for agentic frameworks like LangChain and LlamaIndex and vector databases such as Chroma DB and Cudrant. A notebook walkthrough reinforces the mechanics—using the Google Gen AI SDK, embedding content via client.models, and computing similarity scores across modalities—while the overall message stays consistent: multimodal search becomes simpler, faster to maintain, and more unified when everything lands in one embedding space.
Cornell Notes
Gemini Embedding 2 is described as a single multimodal embedding model that can embed text, images, audio, PDFs, and videos (up to 2 minutes) directly into one shared vector space. That design replaces earlier setups that required separate embedding models, multiple vector stores, and extra fusion/reranking to search across modalities. The transcript explains embeddings as high-dimensional semantic “addresses,” where similarity search retrieves content with related meaning—so a text query can return images, videos, audio, or PDFs, and an image or audio query can return text and other media. It also supports multi-part inputs (e.g., image + text in one request) and offers Matrioska representation learning to output smaller embeddings for speed and storage tradeoffs. The practical result is one index and one query for cross-modal retrieval.
Why does a single multimodal embedding space matter for search systems?
How do embeddings turn different media into something searchable?
What does “natively multimodal” enable beyond basic text-to-image search?
How should long videos be handled for more precise retrieval?
What input limits and embedding-generation options are mentioned?
When building RAG indexes, should a system store one embedding per item or multiple embeddings per part?
Review Questions
- What problems do separate modality-specific embedding models and multiple vector stores create, and how does a shared embedding space address them?
- How would you design an index for a 25-hour university course using Gemini Embedding 2, given the video length limit and the need for precise answers?
- What tradeoffs come with using Matrioska representation learning to output smaller embeddings instead of the full embedding size?
Key Points
- 1
Gemini Embedding 2 is presented as a single natively multimodal embedding model that embeds text, images, audio, PDFs, and videos (up to 2 minutes) into one shared vector space.
- 2
Cross-modal retrieval becomes simpler: one index and one query can return semantically similar results across modalities without transcription or format conversion steps.
- 3
The model supports multi-modal requests (e.g., image + text together) to embed combined meaning, enabling joint-query use cases like “describe + show” retrieval.
- 4
Long videos can be handled by chunking into smaller segments (e.g., 15–30 seconds) to improve timestamp-level specificity for text queries.
- 5
Input constraints include up to 8,000 tokens for text and up to six images per request, with video limited to 2 minutes per input.
- 6
PDF embeddings can be generated via a files API or from bytes using the correct MIME type (application/pdf).
- 7
Matrioska representation learning allows smaller embedding sizes (half or quarter of 3,072) to trade fine-grained semantics for faster lookup and reduced storage.