Get AI summaries of any video or article — Sign up free
FunctionGemma - Function Calling at the Edge thumbnail

FunctionGemma - Function Calling at the Edge

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Function Gemma is a specialized, fine-tunable Gemma 270M model designed to perform structured function calling on edge devices, including phones.

Briefing

Function Gemma brings customizable function calling to a compact Gemma model designed for edge deployment—so apps and games can run locally on phones (and devices like Jetson Nano) while still letting the model trigger real actions. The core shift is moving beyond “chat-only” behavior: instead of hard-coding tool logic on the client, developers can fine-tune a small model to reliably emit structured function calls for the specific tools their application needs.

At the center is a specialized model built on Gemma 270M (270 million parameters), a base model trained on 6 trillion tokens and positioned as strong for its size in edge/mobile settings. Function Gemma keeps that small-model footprint but adds training specifically for function calling, including the special tokens and message structure required to represent tool definitions, function call starts, and tool responses. That training matters because function calling doesn’t work well with generic prompting alone; the model must be tuned to produce the correct call format.

The practical workflow described is: define a tool schema (the “function” the app can execute), provide a user prompt, and have the model output a function call with arguments. The app then runs the tool locally, feeds the tool’s output back into the model as a tool-role message, and the model generates the final response. This mirrors server-side function-calling patterns, but Function Gemma is optimized to make the same idea feasible on-device.

Customization is presented as the main advantage. Out of the box, Function Gemma can struggle with domain-specific tasks—for example, it may refuse or fail to schedule meetings when it hasn’t been fine-tuned for that action. Google’s released notebooks demonstrate fine-tuning using a small actions dataset (under 10k rows), where training quickly reduces validation loss and can approach overfitting on small data. The decisive test is whether the fine-tuned model correctly identifies the intended tool (e.g., “create calendar event”) and fills in structured arguments like date and title, then stops at the function call boundary.

For deployment, the transcript emphasizes edge readiness: the model is available on Hugging Face as a gated download, works with Hugging Face Transformers out of the box, and can be converted to LiteRT (the mobile/edge runtime successor to TensorFlow Lite) for running inside apps. A mobile app demo and examples using transformers.js are mentioned as ways to try function calling fully local in a browser or on a phone.

Overall, Function Gemma is framed as a concrete path to “tool-using” LLM behavior on constrained hardware: start with a small Gemma model, fine-tune it for your app’s exact actions, and export it to LiteRT so the function-calling loop can run locally without a server round-trip. While Gemma 4 isn’t released yet, the release is positioned as a meaningful step toward practical on-device agents.

Cornell Notes

Function Gemma adapts the small Gemma 270M model for structured function calling on edge devices, including phones. It relies on special tokens and a tool-call message flow: the model outputs a function call with arguments, the app executes the tool locally, then the tool output is sent back for a final response. Customization is central—out-of-the-box performance can be weak for specific actions (like scheduling meetings) until fine-tuned on an actions dataset. The release includes Hugging Face access (gated), Transformers-based inference notebooks, a fine-tuning notebook using Hugging Face TRL, and a conversion path to LiteRT for mobile deployment.

What makes Function Gemma different from generic small LLM prompting for “tools”?

Function Gemma is trained specifically for function calling, including the special tokens and message structure needed to represent tool definitions, the start of a function call, and tool responses. That training is what enables the model to reliably emit a structured function call (with arguments) rather than attempting to answer directly or failing to follow the required format.

How does the function-calling loop work on-device, step by step?

A developer provides a tool definition (the function schema) and a user prompt. Function Gemma returns a function call indicating which function to run and the arguments. The app executes that tool locally, then appends the tool output back to the conversation as a tool-role message. The model then produces the final natural-language answer using the tool result.

Why does fine-tuning matter, and what happens without it?

Without fine-tuning for the specific actions, Function Gemma may not handle domain tasks correctly. The transcript gives an example where it can’t assist with scheduling meetings until it’s trained on meeting/calendar-style function calls. After fine-tuning, it can recognize the correct tool (e.g., create calendar event), populate arguments like date and title, and produce the expected function-call output.

What does the fine-tuning setup look like in the provided workflow?

The fine-tuning notebook uses the Hugging Face TRL library. It trains on a mobile actions dataset with fewer than 10k rows. The transcript notes that validation loss drops quickly and training may approach overfitting on small data. It also recommends using an A100 GPU, estimating about 8 minutes for fine-tuning, with batch size and gradient accumulation adjusted for smaller hardware.

How is Function Gemma prepared for mobile/edge deployment after fine-tuning?

After training, the workflow includes pushing the model to Hugging Face and converting checkpoints to LiteRT. LiteRT is described as the modern replacement for TensorFlow Lite, intended for running models on phones and edge devices. This conversion is presented as the key step for embedding the function-calling model into an app.

Where can developers get and run Function Gemma weights?

Function Gemma is hosted on Hugging Face as a gated model, requiring access approval. Once granted, the weights work with Hugging Face Transformers for inference. The transcript also mentions demos via a Google mobile app and the possibility of running in the browser using transformers.js.

Review Questions

  1. What specific training elements (tokens and message structure) enable Function Gemma to produce valid function calls?
  2. Describe the sequence of messages exchanged between the model and the app during a tool call.
  3. Why might Function Gemma fail on a task like scheduling meetings before fine-tuning, and how does fine-tuning change the output?

Key Points

  1. 1

    Function Gemma is a specialized, fine-tunable Gemma 270M model designed to perform structured function calling on edge devices, including phones.

  2. 2

    Reliable tool use depends on model training for function-calling special tokens and the tool-call message format, not just prompting.

  3. 3

    A complete on-device loop is: model emits a function call → app runs the tool locally → tool output is sent back → model generates the final response.

  4. 4

    Out-of-the-box Function Gemma can underperform on domain-specific actions; fine-tuning on an actions dataset improves accuracy for those tools.

  5. 5

    The release provides Hugging Face Transformers notebooks for inference, TRL-based notebooks for fine-tuning, and a conversion path to LiteRT for mobile deployment.

  6. 6

    Function Gemma weights are available on Hugging Face as a gated download, requiring access approval before use.

Highlights

Function Gemma turns a small Gemma model into a tool-using model by training it for the exact function-call token and message protocol.
Customization is the difference between “can’t schedule meetings” and correctly emitting a structured create-calendar-event call with filled arguments.
After fine-tuning, converting to LiteRT is presented as the practical bridge from research notebooks to running on phones.

Topics

Mentioned

  • LLM
  • TRL
  • LiteRT
  • RAG
  • GPU
  • ML