Get AI summaries of any video or article — Sign up free
Build Anything with Local Agents, Here’s How thumbnail

Build Anything with Local Agents, Here’s How

David Ondrej·
5 min read

Based on David Ondrej's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Install Ollama locally to run open-source LLMs on the user’s machine and avoid API fees while keeping data private.

Briefing

Running AI agents locally—without paying API fees—hinges on two pieces: a local model runtime (Ollama) and an agent framework (CrewAI). The setup described pairs Ollama with open-source models so chats and outputs stay on the user’s machine, then uses CrewAI to coordinate multiple specialized agents that generate system prompts and task outputs. The practical payoff is automation: instead of hand-writing system prompts for every new agent, a “prompt engineering” agent can draft structured instructions in seconds.

The walkthrough begins by installing Ollama from ama.com, launching it, and then installing CrewAI in a Python environment. It emphasizes environment hygiene (creating a dedicated virtual environment in VS Code) and then installs dependencies for agent orchestration and model integration. For model access, it uses LangChain’s Ollama integration, noting that the community package is the recommended path. From there, the guide selects a specific open-source model—Mixtral (a Mixture of Experts model)—and explains how model choice depends on hardware and speed. Mixtral’s size and resource demands are framed against typical consumer setups: smaller models like 7B are easier on memory, while larger variants (13B, 47B, 70B) require more capable machines.

Once Mixtral is running in Ollama, the code links the model into the agent framework so CrewAI can call it automatically. The next step is defining a reusable workflow driven by a variable called topic. That variable lets the same agent team generate prompt instructions for different domains—web development, customer support, programming, financial analysis, or essay writing—without rewriting everything.

Two agents are created: a research agent and a prompt engineer. Each agent receives a goal and a backstory formatted as system-style instructions, with verbosity enabled and delegation disabled to keep the workflow focused. The research agent’s job is to gather relevant information about what an expert in the chosen topic would do; the prompt engineer then turns that understanding into a single structured system prompt written in Markdown. Both agents are paired with tasks that specify descriptions and “expected output” examples to improve consistency.

A CrewAI “crew” ties the agents and tasks together in a sequential process: research first, then prompt engineering. The walkthrough includes troubleshooting points—such as using the correct parameter names (LLM vs model) and ensuring the code uses “agents” and “tasks” correctly—before showing successful runs. The resulting system prompts are presented as already strong: when switched from web development to customer support, the generated instructions include clear guidelines on policies, clarity, and minimizing confusion.

Overall, the core insight is that local agent teams can produce usable, structured system prompts quickly—often better than what many teams write manually—while keeping data private and avoiding recurring API costs. The guide positions this as a starting point, with a promise of further optimization in a follow-up.

Cornell Notes

The setup combines Ollama (to run open-source LLMs locally) with CrewAI (to coordinate multiple agents). After installing both and linking CrewAI to an Ollama model like Mixtral, the workflow defines a variable “topic” so the same agent team can generate domain-specific system prompts. Two agents—one for research and one for prompt engineering—run sequential tasks: the research agent gathers key points about an expert approach, then the prompt engineer converts that into a structured Markdown system prompt. Expected-output examples and disabled delegation help keep results consistent. This matters because it automates prompt creation without API fees and keeps chats on the user’s machine.

Why does the guide insist on running models locally with Ollama instead of using an API?

It frames local execution as a way to avoid paying API costs while keeping chats private. The workflow runs open-source models directly on the user’s computer via Ollama, so prompts and outputs don’t need to leave the machine. That privacy and cost control are treated as the main motivation for the entire setup.

How does the choice of Mixtral (and other model sizes) affect what a user can run?

Model choice is tied to hardware and speed. The guide uses Mixtral as a “cheap low-tier hardware” option, describing 7B as a practical ceiling for many midrange machines and larger options like 13B, 47B, and 70B for higher-end systems. It also notes that slower internet can matter for downloads, and that larger models take longer to run and require more memory.

What role does the “topic” variable play in making the agent team reusable?

The “topic” variable is inserted into the agents’ goals and backstories using f-strings. That means the same research-and-prompt-engineering pipeline can be reused for web development, customer support, programming, financial analysis, or essay writing simply by changing one variable, rather than rewriting prompts and tasks from scratch.

Why split work into two agents—researcher and prompt engineer—instead of one?

The two-step design improves structure and consistency. The research agent focuses on extracting relevant information about how an expert would approach the topic, while the prompt engineer turns that understanding into a single structured system prompt in Markdown. Expected-output examples further constrain what the prompt engineer should produce.

What configuration details are used to keep agent behavior predictable?

The guide sets verbosity (true for more output), disables delegation (so the research agent doesn’t hand off work), and uses “expected output” fields in tasks to provide a concrete example of the desired format. It also runs the crew with a sequential process so the prompt engineer always follows the research step.

What common coding mistakes does the walkthrough warn about?

It flags parameter name issues and typos: using “model” vs “llm” when wiring the CrewAI run, and using “agent” vs “agents” / “task” vs “tasks” in the crew definition. These small naming errors can prevent execution even when the logic is correct.

Review Questions

  1. If you wanted to generate system prompts for a new domain (e.g., financial analysis), which parts of the code would you change—agents, tasks, or only the topic variable? Why?
  2. How do expected-output examples and sequential processing work together to improve consistency in multi-agent outputs?
  3. What hardware-related tradeoffs influence whether you choose Mixtral versus a smaller model like a 7B option?

Key Points

  1. 1

    Install Ollama locally to run open-source LLMs on the user’s machine and avoid API fees while keeping data private.

  2. 2

    Use a dedicated Python environment (e.g., in VS Code) before installing CrewAI and related dependencies to prevent dependency conflicts.

  3. 3

    Link CrewAI to an Ollama model through LangChain’s Ollama integration so agents can call the local model automatically.

  4. 4

    Choose a model size based on available RAM and desired speed; larger Mixtral variants require more memory and can run slower.

  5. 5

    Create a reusable “topic” variable and inject it into agent goals and backstories so the same agent team can handle many domains.

  6. 6

    Define separate agents for research and prompt engineering, then run tasks sequentially to produce a structured system prompt in Markdown.

  7. 7

    Use task “expected output” examples and disable delegation to make outputs more consistent and less chaotic.

Highlights

Ollama plus CrewAI enables multi-agent prompt generation entirely on-device, aiming to eliminate recurring API costs and keep chats private.
A single “topic” variable drives domain switching—web development to customer support—without rebuilding the agent team.
Two-agent sequencing (research first, prompt engineering second) produces structured system prompts quickly, often in seconds.
Model choice is a practical constraint: Mixtral variants scale memory and runtime, so hardware determines what’s feasible.

Mentioned

  • LLM
  • API
  • VS Code
  • Mac OS
  • CPU
  • RAM
  • Mixture of Experts
  • Mixtrol
  • Ollama
  • CrewAI
  • LangChain