EmbeddingGemma - Micro Embeddings for Mobile Devices
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
EmbeddingGemma is built for on-device text embeddings, enabling semantic search and micro RAG without internet access.
Briefing
EmbeddingGemma is a family of tiny, text-only embedding models designed to run on-device, enabling retrieval, semantic search, clustering, and “micro RAG” workflows on phones, Raspberry Pi-class hardware, or in the browser without relying on internet access. The core pitch is practical: small models that still deliver strong embedding quality, letting developers generate useful vector representations with minimal compute and memory.
The model lineup sits in the broader Gemma push toward two tracks: on-device variants and research-oriented variants. On the on-device side, Gemma 3N arrived with tooling for mobile and edge use, including deployment guidance for developers. On the research side, the Gemma ecosystem includes T5-style Gemma variants and very small, long-trained models such as Gemma 3270M, which was highlighted as being trained on 6 trillion tokens—an example of what sustained training can do for compact models. EmbeddingGemma builds on the T5 Gemma initialization and targets embedding generation specifically.
EmbeddingGemma accepts text up to roughly 200 tokens and outputs “matrioska embeddings,” meaning multiple embedding sizes are available from the same family—up to 768 dimensions at the largest setting down to 128 at the smallest. The release includes multiple variants, including quantized options and QAT (quantization-aware training) versions, aimed at fitting different hardware constraints. The models are described as very small—around 300M parameters in the referenced setup—making them suitable for edge deployment where VRAM is limited.
Quality comparisons are a major part of the case. Using MTEB-style evaluation, EmbeddingGemma is presented as outperforming a larger Quen embedding model despite being smaller, and it also holds up well against other models in similar or even larger size ranges. The takeaway is that developers can trade off size and still get embeddings that work across common tasks like retrieval, classification, and clustering.
A concrete example shows how to use the embeddings in Python with the sentence-transformers ecosystem. A query such as “which planet is known as the red planet” is encoded into a 768-dimensional vector, the candidate documents are encoded into vectors of the same size, and cosine-like similarity is computed to rank documents. The highest-similarity document surfaces the expected answer about Mars.
For a more end-to-end system, a simple RAG pipeline is assembled using LangGraph/LangChain plus a Chroma vector database. Text from a Gemma 3270 blog post is chunked, embedded with EmbeddingGemma, and stored. At query time, the system retrieves relevant chunks via embedding similarity and then feeds the retrieved context into a Gemma 3N model to generate an answer. The demo response is not framed as perfect—there’s a note that better chat formatting could improve results—but it demonstrates the workflow.
Overall, EmbeddingGemma is positioned as a building block for offline or low-latency semantic features on constrained devices: generate embeddings quickly on CPU, keep GPU memory usage small, and combine embeddings with Gemma 3N-class LLMs to deliver retrieval-augmented experiences on the edge.
Cornell Notes
EmbeddingGemma is a set of small, text-only embedding models built for on-device use, aimed at powering micro RAG, semantic search, retrieval, classification, and clustering without needing internet access. It supports inputs up to about 200 tokens and provides “matrioska embeddings,” letting developers choose embedding dimensionality from 768 down to 128. The family includes quantized and QAT variants to better fit mobile and edge hardware constraints. In evaluations using MTEB-style scores, EmbeddingGemma is presented as strong even against larger embedding models, while remaining lightweight enough to run with tiny VRAM footprints and fast CPU inference. A Python example uses sentence-transformers for similarity ranking, and a simple RAG demo stores embedded document chunks in Chroma and retrieves them before generating answers with Gemma 3N.
What makes EmbeddingGemma different from typical embedding models in deployment terms?
How does “matrioska embeddings” affect what developers can build?
How does the transcript demonstrate basic embedding-based retrieval in code?
What does the micro RAG pipeline look like in the demo?
Why are quantized and QAT versions highlighted?
Review Questions
- What input length limit and embedding dimensionality range does EmbeddingGemma support in the transcript?
- Describe the steps used to rank documents using query/document embeddings in the Python example.
- In the micro RAG demo, where do embeddings live, and how are they used at question time?
Key Points
- 1
EmbeddingGemma is built for on-device text embeddings, enabling semantic search and micro RAG without internet access.
- 2
The models accept text up to about 200 tokens and output “matrioska embeddings” ranging from 768 dimensions down to 128.
- 3
Multiple variants are offered, including quantized and QAT versions, to fit mobile and edge constraints.
- 4
MTEB-style evaluation results are presented as strong for EmbeddingGemma, including comparisons against larger embedding models.
- 5
A basic retrieval workflow encodes a query and documents into vectors, computes similarity, and returns the top-matching document.
- 6
A simple RAG system uses EmbeddingGemma to embed chunked documents into a Chroma vector store, then retrieves relevant chunks before generating with Gemma 3N.
- 7
Embeddings are positioned as fast enough to run on CPU with small VRAM needs, making offline semantic features practical on constrained hardware.