Get AI summaries of any video or article — Sign up free
Build AI Agent from 0 to Production Deployment | LangChain, Ollama, MLflow & Docker (Full Tutorial) thumbnail

Build AI Agent from 0 to Production Deployment | LangChain, Ollama, MLflow & Docker (Full Tutorial)

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

An agent loop lets an LM decide between stopping and calling tools, then uses tool outputs to produce the final answer.

Briefing

A unit-conversion AI agent can be built end-to-end—from a single-tool loop to a streaming REST API—then packaged into a Docker container and deployed to production. The practical payoff is a working service where one request like “convert 10 mi to kilometers and 150 lbs to kilograms” triggers multiple tool calls (distance and weight) and returns the final converted values in a single response.

The build starts with a clear definition of an agent: a language model (LM) that has access to an environment plus a set of callable tools. In this setup, the LM receives the user’s question (optionally with chat history) and decides whether it should stop or call one or more tools. Tool outputs are fed back into the LM, which then produces the final answer. The agent’s control flow is implemented as a loop that keeps running until the LM indicates completion.

The first milestone is a rudimentary “single tool” agent for weight conversion. A conversion tool is defined as a LangChain tool that takes inputs like a numeric value plus source and target units, then performs the conversion using standard factors (for example, 1 kg ≈ 2.2 lb). A system prompt instructs the LM to use the provided tools to complete the task. During execution, the LM emits a tool call, the tool returns the computed result, and the LM formats the final response.

Next comes an upgrade: streaming output and a safer execution loop. The agent is modified to accept user input dynamically, stream the LM’s response chunks to the client, and capture tool-call events (tool name, arguments, and tool result). A max-iteration limit prevents the loop from running indefinitely, and the agent raises an error if it hits that ceiling.

The agent then gains more capability by adding additional tools—specifically distance and temperature conversion—so it can handle multi-part queries. With multiple tools registered, the LM can call them sequentially within the same request, producing combined results such as converting kilograms to pounds and kilometers to miles.

To make the system production-ready, observability is added using MLflow. The setup configures an MLflow tracking URI, creates an experiment (e.g., “unit conversion agent”), and records traces that show the tool list, system prompt, tool calls, tool outputs, and the final agent response. This makes it possible to inspect what happened “under the hood” during development and after deployment.

Finally, the agent is wrapped in a FastAPI service. An asynchronous “ask” endpoint streams results back to clients using Server-Sent Events (text/event-stream). A response generator converts the agent’s content chunks and tool execution records into JSON payloads so clients can render both the evolving text and the underlying tool activity. Environment variables control the model name and provider so local development can use Ollama while production swaps to Gemini 2.5 Flash.

Packaging and deployment complete the pipeline. A multi-stage Docker build reduces the shipped image size by separating dependency compilation from runtime. Docker Compose runs the container locally with host access settings, health checks hit the root endpoint every 30 seconds, and the same container is deployed to Render as a Docker web service. In production, the API behaves the same but uses Gemini 2.5 Flash, delivering faster responses while preserving the streaming interface and tool-call transparency.

Cornell Notes

The project builds a tool-using AI agent that converts units, then turns it into a streaming REST API and deploys it in Docker. The agent works as a loop: an LM reads the user request, decides whether to stop or call one of several conversion tools, and then uses tool outputs to generate the final answer. It starts with a single weight-conversion tool, then adds streaming responses, a max-iteration guard, and additional tools for distance and temperature so one query can trigger multiple tool calls. MLflow tracing records prompts, tool calls, and results for debugging and monitoring. FastAPI exposes an async SSE endpoint that streams both text chunks and tool execution details, with environment variables switching models from local Ollama to Gemini 2.5 Flash in production.

How does the agent decide when to call a tool versus when to finish the response?

The agent runs an iterative loop where each cycle sends the current chat history to the LM. The LM either returns a “stop” decision or emits a tool call. When a tool call is present, the code invokes the named tool with the provided arguments, appends the tool result back into the history, and continues. When no tool calls remain, the loop ends and the LM’s final response is returned to the user.

What does a single conversion tool look like in this setup, and how is it used by the LM?

A conversion tool is implemented as a LangChain tool function that accepts inputs such as a numeric value plus source and target units. For weight conversion, the tool performs the calculation using standard factors (the transcript cites 1 kg ≈ 2.2 lb). The LM is bound to the tool via tool binding, then the LM emits a tool call (e.g., convert weight from kilograms to pounds), receives the tool’s computed output, and formats the final answer.

Why add streaming and a max-iteration limit to the agent loop?

Streaming improves user experience by sending partial text chunks as they are generated, rather than waiting for the full completion. The max-iteration limit prevents runaway behavior in the tool-calling loop; if the agent keeps requesting tool calls beyond the allowed number of iterations, execution stops with an error. The implementation also captures tool-call events during streaming so clients can see tool names, arguments, and results.

How does the agent handle multi-part queries like converting both distance and weight in one request?

Additional tools are registered (e.g., convert distance and convert temperature alongside convert weight). With multiple tools available, the LM can emit multiple tool calls in sequence for a single user query. The transcript’s example shows two tool calls occurring back-to-back—one for weight and one for distance—followed by a combined final response that includes both conversions.

What does MLflow tracing add, and what specific details can be inspected?

MLflow integration records an experiment (named “unit conversion agent”) and produces traces that include the available tools and their descriptions/arguments, the agent’s system prompt, each tool call the LM made, the tool outputs, and the final agent response. This allows developers to verify tool selection and see the full execution path in both development and production contexts.

How does the FastAPI REST layer stream agent output to clients?

FastAPI exposes an async “ask” endpoint that returns a streaming response using Server-Sent Events (media type text/event-stream). A response generator converts the agent’s content chunks and tool execution records into JSON events. The client receives incremental text chunks plus structured tool-call information (tool name, arguments, result formatted for display), enabling real-time updates while the agent runs.

Review Questions

  1. What changes are required to go from a single-tool agent to a multi-tool agent that can handle combined unit conversions in one request?
  2. How do streaming responses and max-iteration limits work together to make an agent safer and more user-friendly?
  3. Where in the stack do MLflow traces capture tool calls, and how would you use those traces to debug incorrect conversions?

Key Points

  1. 1

    An agent loop lets an LM decide between stopping and calling tools, then uses tool outputs to produce the final answer.

  2. 2

    Conversion tools are implemented as LangChain tools that take numeric values plus source/target units and apply standard conversion factors.

  3. 3

    Streaming is added by emitting LM content chunks incrementally and capturing tool-call events (tool name, arguments, result) during execution.

  4. 4

    A max-iteration guard prevents infinite tool-calling loops and forces a controlled failure when limits are reached.

  5. 5

    MLflow tracing provides end-to-end visibility into prompts, tool selection, tool arguments, tool outputs, and final responses via recorded traces.

  6. 6

    FastAPI wraps the agent in an async SSE endpoint so clients receive incremental JSON events rather than a single blocking response.

  7. 7

    Multi-stage Docker builds and environment-variable model switching enable the same API to run locally with Ollama and in production with Gemini 2.5 Flash.

Highlights

The agent’s core mechanism is a stop-or-tool-call loop: the LM requests tool executions, tool results are fed back, and the LM then finalizes the response.
Streaming isn’t just text—tool-call details are also emitted so clients can observe what the agent did and why.
MLflow traces capture the full execution path, including the system prompt, tool arguments, tool outputs, and the final response.
FastAPI uses Server-Sent Events (text/event-stream) to stream both content chunks and structured tool execution JSON to the client.
A multi-stage Docker build plus Render deployment turns the local agent into a production-ready service with model/provider controlled by environment variables.

Topics

Mentioned

  • LM
  • REST
  • API
  • MLflow
  • SSE
  • FastAPI
  • Docker
  • JSON
  • UI
  • URI
  • SSE