EmbeddingGemma - Micro Embeddings for Mobile Devices

TL;DR

EmbeddingGemma is built for on-device text embeddings, enabling semantic search and micro RAG without internet access.

Briefing Cornell Notes

Briefing

EmbeddingGemma is a family of tiny, text-only embedding models designed to run on-device, enabling retrieval, semantic search, clustering, and “micro RAG” workflows on phones, Raspberry Pi-class hardware, or in the browser without relying on internet access. The core pitch is practical: small models that still deliver strong embedding quality, letting developers generate useful vector representations with minimal compute and memory.

The model lineup sits in the broader Gemma push toward two tracks: on-device variants and research-oriented variants. On the on-device side, Gemma 3N arrived with tooling for mobile and edge use, including deployment guidance for developers. On the research side, the Gemma ecosystem includes T5-style Gemma variants and very small, long-trained models such as Gemma 3270M, which was highlighted as being trained on 6 trillion tokens—an example of what sustained training can do for compact models. EmbeddingGemma builds on the T5 Gemma initialization and targets embedding generation specifically.

EmbeddingGemma accepts text up to roughly 200 tokens and outputs “matrioska embeddings,” meaning multiple embedding sizes are available from the same family—up to 768 dimensions at the largest setting down to 128 at the smallest. The release includes multiple variants, including quantized options and QAT (quantization-aware training) versions, aimed at fitting different hardware constraints. The models are described as very small—around 300M parameters in the referenced setup—making them suitable for edge deployment where VRAM is limited.

Quality comparisons are a major part of the case. Using MTEB-style evaluation, EmbeddingGemma is presented as outperforming a larger Quen embedding model despite being smaller, and it also holds up well against other models in similar or even larger size ranges. The takeaway is that developers can trade off size and still get embeddings that work across common tasks like retrieval, classification, and clustering.

A concrete example shows how to use the embeddings in Python with the sentence-transformers ecosystem. A query such as “which planet is known as the red planet” is encoded into a 768-dimensional vector, the candidate documents are encoded into vectors of the same size, and cosine-like similarity is computed to rank documents. The highest-similarity document surfaces the expected answer about Mars.

For a more end-to-end system, a simple RAG pipeline is assembled using LangGraph/LangChain plus a Chroma vector database. Text from a Gemma 3270 blog post is chunked, embedded with EmbeddingGemma, and stored. At query time, the system retrieves relevant chunks via embedding similarity and then feeds the retrieved context into a Gemma 3N model to generate an answer. The demo response is not framed as perfect—there’s a note that better chat formatting could improve results—but it demonstrates the workflow.

Overall, EmbeddingGemma is positioned as a building block for offline or low-latency semantic features on constrained devices: generate embeddings quickly on CPU, keep GPU memory usage small, and combine embeddings with Gemma 3N-class LLMs to deliver retrieval-augmented experiences on the edge.

Cornell Notes

EmbeddingGemma is a set of small, text-only embedding models built for on-device use, aimed at powering micro RAG, semantic search, retrieval, classification, and clustering without needing internet access. It supports inputs up to about 200 tokens and provides “matrioska embeddings,” letting developers choose embedding dimensionality from 768 down to 128. The family includes quantized and QAT variants to better fit mobile and edge hardware constraints. In evaluations using MTEB-style scores, EmbeddingGemma is presented as strong even against larger embedding models, while remaining lightweight enough to run with tiny VRAM footprints and fast CPU inference. A Python example uses sentence-transformers for similarity ranking, and a simple RAG demo stores embedded document chunks in Chroma and retrieves them before generating answers with Gemma 3N.

What makes EmbeddingGemma different from typical embedding models in deployment terms?

It’s designed to run on-device and on the edge. The transcript emphasizes very small model sizes (around 300M in the example), quantized/QAT variants for constrained hardware, and low memory usage—embeddings can run quickly off CPU with only small GPU VRAM when used with the rest of the pipeline. That combination is what enables offline semantic search and micro RAG on phones or Raspberry Pi-class devices.

How does “matrioska embeddings” affect what developers can build?

Matrioska embeddings let the same embedding family produce different vector sizes. The transcript notes outputs ranging from 768 dimensions at the largest setting down to 128 at the smallest. That means developers can tune accuracy vs. compute/memory: smaller vectors reduce storage and speed up similarity search, while larger vectors can improve retrieval quality.

How does the transcript demonstrate basic embedding-based retrieval in code?

It uses sentence-transformers-style steps: encode a query string into a vector (768 dimensions in the example), encode a list of document strings into vectors of the same size, then compute similarity between the query embedding and each document embedding. The highest similarity score determines which document is returned—e.g., the query about the “red planet” ranks the Mars description highest.

What does the micro RAG pipeline look like in the demo?

The demo builds a simple retrieval-then-generation flow. It loads and chunks text (a Gemma 3270 blog post), embeds chunks with EmbeddingGemma, and stores them in a Chroma vector database. A LangGraph/LangChain-style graph then retrieves the most relevant chunks for a user question and passes the retrieved context into a Gemma 3N model to generate the final answer.

Why are quantized and QAT versions highlighted?

Because the goal is on-device embedding generation under tight hardware limits. Quantization and QAT variants reduce model size and compute cost, making it more feasible to run embeddings on mobile devices and edge hardware while still maintaining usable embedding quality.

Review Questions

What input length limit and embedding dimensionality range does EmbeddingGemma support in the transcript?
Describe the steps used to rank documents using query/document embeddings in the Python example.
In the micro RAG demo, where do embeddings live, and how are they used at question time?

Key Points

1
EmbeddingGemma is built for on-device text embeddings, enabling semantic search and micro RAG without internet access.
2
The models accept text up to about 200 tokens and output “matrioska embeddings” ranging from 768 dimensions down to 128.
3
Multiple variants are offered, including quantized and QAT versions, to fit mobile and edge constraints.
4
MTEB-style evaluation results are presented as strong for EmbeddingGemma, including comparisons against larger embedding models.
5
A basic retrieval workflow encodes a query and documents into vectors, computes similarity, and returns the top-matching document.
6
A simple RAG system uses EmbeddingGemma to embed chunked documents into a Chroma vector store, then retrieves relevant chunks before generating with Gemma 3N.
7
Embeddings are positioned as fast enough to run on CPU with small VRAM needs, making offline semantic features practical on constrained hardware.

Highlights

EmbeddingGemma targets offline, on-device semantic features—retrieval, classification, clustering, and micro RAG—using tiny embedding models.

“Matrioska embeddings” provide multiple vector sizes (128 to 768 dimensions), letting developers balance accuracy and resource use.

The demo shows a full flow: embed chunked text into Chroma, retrieve by similarity, then generate answers with Gemma 3N.

The transcript emphasizes that embeddings can run quickly off CPU with minimal VRAM, supporting deployment on phones and edge devices.

Topics

Embedding Models
On-Device AI
Micro RAG
Vector Similarity
Quantization

Mentioned

Sam Witteveen
RAG
QAT
MTEB