Get AI summaries of any video or article — Sign up free
Llama 3 8B: BIG Step for Local AI Agents! - Full Tutorial (Build Your Own Tools) thumbnail

Llama 3 8B: BIG Step for Local AI Agents! - Full Tutorial (Build Your Own Tools)

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Give the local agent a small set of explicit tools (search Google, check context/RAG, send email) and let the model choose among them via structured, parseable outputs.

Briefing

A local Llama 3 8B agent can be made genuinely useful by giving it a small set of “tools” (Google search, RAG-based retrieval, and email sending) and wiring those tools to the model through lightweight, custom function-calling logic—without relying on LangChain. In the demo, the agent searches the web via SerpAPI, scrapes results from AI Meta and The Verge, embeds the scraped text into a local RAG vault, then answers questions by querying that vault. When asked how many tokens Llama 3 was trained on, it retrieves context from the stored pages and returns a figure of up to 15 trillion tokens.

The practical payoff is that the agent doesn’t just generate text; it triggers actions. After retrieving the training-token claim, it uses a dedicated “send email” function to email the information to the user, with the transcript showing “email sent successfully” and the received message containing the retrieved claim. The creator emphasizes that this works on an 8B model running locally on “AMA” (the transcript’s local environment), and that instruction-following is strong enough to drive tool use reliably—something the tutorial contrasts with earlier attempts using smaller local models.

Under the hood, the system is built around three core functions: a search Google function (using SerpAPI to return URLs), a scrape-and-store step that adds page text into a RAG vault, and a check context function that queries the RAG system. A fourth tool, send email, is added to demonstrate outward actions beyond information retrieval. The “intelligent” part happens in the chat loop: user requests are interpreted so the model can decide when to call a tool, then return a structured response that includes a special wrapper instruction.

That wrapper is parsed by a “parse function call” routine acting like a detector. It scans the model’s output for specific tags (the transcript calls them wrapper tags and a secret instruction note), extracts a JSON-like instruction payload, converts it into a simple Python dictionary, and then executes the requested function with the provided arguments. The tutorial stresses that the model must fill in argument values—like the Google query—by replacing placeholders with the user’s intent.

The tutorial also demonstrates end-to-end workflows: searching for available AMA models via Google, checking whether AMA has the Llama 3 Model using RAG (“ama pool Llama 3 command to import it”), and then sending the resulting guidance by email. Finally, it shows how to extend the agent with a new custom tool: a “Write to notes” function that appends user-provided content to notes.txt. The system message is updated to instruct the model to emit the correct function-call wrapper when users ask to write notes, the function schema is added to the OpenAI-style function list, and a small conditional block in the chat logic triggers the new tool.

Overall, the central insight is that a local agent becomes practical when tool calls are deterministic and parseable: the model decides what to do, but the surrounding code enforces how tools are invoked, how retrieved context is stored, and how actions like emailing and file writes are executed.

Cornell Notes

Llama 3 8B can run as a local agent that performs real tasks by combining tool functions with a simple, custom function-calling protocol. The workflow starts with a Google search tool (via SerpAPI), scrapes top URLs, embeds the text into a local RAG vault, and answers questions by querying that vault. A separate “send email” tool turns retrieved answers into an action, and the transcript shows successful email delivery. Tool calls are triggered through structured wrapper tags in the model’s output; a parse routine extracts a JSON-like instruction, converts it into a Python dictionary, and executes the requested function with model-filled arguments. The tutorial then extends the system with a “Write to notes” tool that appends content to notes.txt.

How does the agent turn a natural-language request into a tool action like web search?

The chat loop feeds user input to the model along with a system message that instructs tool usage. When the user request contains phrases like “search Google,” the model returns a structured response containing wrapper tags plus a secret instruction note. A parse function call routine scans the model output for those wrapper tags, extracts the embedded instruction payload, converts it into a simple Python dictionary, and then executes the named function (e.g., search Google) with arguments such as the query string derived from the user’s intent.

What role does RAG play in answering factual questions in the demo?

RAG provides grounded answers by searching embedded content stored in a local vault. After the agent uses the search Google function to fetch URLs, it scrapes the pages and adds their text to the vault. When asked a question like “how many tokens was llama 3 trained on,” the agent uses a check context function to query the vault and returns the answer based on the retrieved embedded text, which in the transcript is tied to a claim of up to 15 trillion tokens.

How are external actions handled beyond information retrieval?

The system includes a dedicated send email function. Once the agent has retrieved information via RAG, it can trigger the email tool by emitting the correct function-call wrapper with the email content as an argument. The transcript shows “email sent successfully” and then reading the received message containing the retrieved claim about Llama 3 improving log linear after training up to 15 trillion tokens.

What makes the function-calling approach work without LangChain?

Tool invocation is deterministic because it relies on custom parsing of wrapper tags rather than a higher-level framework. The model’s output is monitored for specific tags; the parse function call routine extracts a JSON-like instruction, translates it into a Python dictionary, and then runs the corresponding function from a predefined list. The tutorial also notes an OpenAI-style function conversion step (“convert to open AI function”) so the model has clear function descriptions and parameter schemas.

How can the agent be extended with a new tool like writing to a file?

To add “Write to notes,” the system message is updated with instructions that detect user intent (e.g., “write note” or “right note”) and require the model to emit the correct function-call wrapper. The function schema is added to the functions list with a parameter like note content (string). Finally, the chat logic includes an if-statement that routes the parsed instruction to the right function, which appends the content to notes.txt. The demo confirms the tool by writing an extracted email address to notes.txt.

Review Questions

  1. What specific wrapper-tag mechanism does the parse function call routine use to decide which tool to execute?
  2. How does the agent ensure answers come from scraped web content rather than only from the model’s prior knowledge?
  3. What changes are required in the system message, function schema, and chat logic to add a new tool like Write to notes?

Key Points

  1. 1

    Give the local agent a small set of explicit tools (search Google, check context/RAG, send email) and let the model choose among them via structured, parseable outputs.

  2. 2

    Use SerpAPI to fetch Google results, scrape the top URLs, and embed the scraped text into a local RAG vault for grounded retrieval.

  3. 3

    Implement a parse function call routine that detects wrapper tags in the model output, extracts a JSON-like instruction, converts it into a Python dictionary, and executes the requested function.

  4. 4

    Design tool arguments so the model must fill in user-derived values (e.g., the Google query string) while the code enforces the function name and parameter structure.

  5. 5

    Maintain a conversation history in the chat loop so follow-up requests can reuse prior context and tool outputs.

  6. 6

    Extend functionality by updating the system message intent rules, adding the new function schema (parameters/descriptions), and adding a conditional branch in the chat logic to run the new tool.

Highlights

The agent retrieves a factual claim by scraping two URLs (AI Meta and The Verge), embedding them into a local RAG vault, and answering from that stored context.
A dedicated send email tool turns retrieved information into an action; the transcript shows successful email delivery containing the RAG-based answer.
Tool calls are triggered through wrapper tags in the model’s output, then executed via a custom parse-and-dispatch routine—no LangChain required.
Adding a new capability is mostly plumbing: update the system message, register the function schema, and add a small if-statement to execute it.
The demo emphasizes that an instruction-following 8B local model can drive reliable tool use when the function-calling protocol is strict and parseable.

Topics

Mentioned

  • RAG
  • AMA
  • LLM
  • JSON
  • API