Build AI Agent from 0 to Production Deployment | LangChain, Ollama, MLflow & Docker (Full Tutorial)
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
An agent loop lets an LM decide between stopping and calling tools, then uses tool outputs to produce the final answer.
Briefing
A unit-conversion AI agent can be built end-to-end—from a single-tool loop to a streaming REST API—then packaged into a Docker container and deployed to production. The practical payoff is a working service where one request like “convert 10 mi to kilometers and 150 lbs to kilograms” triggers multiple tool calls (distance and weight) and returns the final converted values in a single response.
The build starts with a clear definition of an agent: a language model (LM) that has access to an environment plus a set of callable tools. In this setup, the LM receives the user’s question (optionally with chat history) and decides whether it should stop or call one or more tools. Tool outputs are fed back into the LM, which then produces the final answer. The agent’s control flow is implemented as a loop that keeps running until the LM indicates completion.
The first milestone is a rudimentary “single tool” agent for weight conversion. A conversion tool is defined as a LangChain tool that takes inputs like a numeric value plus source and target units, then performs the conversion using standard factors (for example, 1 kg ≈ 2.2 lb). A system prompt instructs the LM to use the provided tools to complete the task. During execution, the LM emits a tool call, the tool returns the computed result, and the LM formats the final response.
Next comes an upgrade: streaming output and a safer execution loop. The agent is modified to accept user input dynamically, stream the LM’s response chunks to the client, and capture tool-call events (tool name, arguments, and tool result). A max-iteration limit prevents the loop from running indefinitely, and the agent raises an error if it hits that ceiling.
The agent then gains more capability by adding additional tools—specifically distance and temperature conversion—so it can handle multi-part queries. With multiple tools registered, the LM can call them sequentially within the same request, producing combined results such as converting kilograms to pounds and kilometers to miles.
To make the system production-ready, observability is added using MLflow. The setup configures an MLflow tracking URI, creates an experiment (e.g., “unit conversion agent”), and records traces that show the tool list, system prompt, tool calls, tool outputs, and the final agent response. This makes it possible to inspect what happened “under the hood” during development and after deployment.
Finally, the agent is wrapped in a FastAPI service. An asynchronous “ask” endpoint streams results back to clients using Server-Sent Events (text/event-stream). A response generator converts the agent’s content chunks and tool execution records into JSON payloads so clients can render both the evolving text and the underlying tool activity. Environment variables control the model name and provider so local development can use Ollama while production swaps to Gemini 2.5 Flash.
Packaging and deployment complete the pipeline. A multi-stage Docker build reduces the shipped image size by separating dependency compilation from runtime. Docker Compose runs the container locally with host access settings, health checks hit the root endpoint every 30 seconds, and the same container is deployed to Render as a Docker web service. In production, the API behaves the same but uses Gemini 2.5 Flash, delivering faster responses while preserving the streaming interface and tool-call transparency.
Cornell Notes
The project builds a tool-using AI agent that converts units, then turns it into a streaming REST API and deploys it in Docker. The agent works as a loop: an LM reads the user request, decides whether to stop or call one of several conversion tools, and then uses tool outputs to generate the final answer. It starts with a single weight-conversion tool, then adds streaming responses, a max-iteration guard, and additional tools for distance and temperature so one query can trigger multiple tool calls. MLflow tracing records prompts, tool calls, and results for debugging and monitoring. FastAPI exposes an async SSE endpoint that streams both text chunks and tool execution details, with environment variables switching models from local Ollama to Gemini 2.5 Flash in production.
How does the agent decide when to call a tool versus when to finish the response?
What does a single conversion tool look like in this setup, and how is it used by the LM?
Why add streaming and a max-iteration limit to the agent loop?
How does the agent handle multi-part queries like converting both distance and weight in one request?
What does MLflow tracing add, and what specific details can be inspected?
How does the FastAPI REST layer stream agent output to clients?
Review Questions
- What changes are required to go from a single-tool agent to a multi-tool agent that can handle combined unit conversions in one request?
- How do streaming responses and max-iteration limits work together to make an agent safer and more user-friendly?
- Where in the stack do MLflow traces capture tool calls, and how would you use those traces to debug incorrect conversions?
Key Points
- 1
An agent loop lets an LM decide between stopping and calling tools, then uses tool outputs to produce the final answer.
- 2
Conversion tools are implemented as LangChain tools that take numeric values plus source/target units and apply standard conversion factors.
- 3
Streaming is added by emitting LM content chunks incrementally and capturing tool-call events (tool name, arguments, result) during execution.
- 4
A max-iteration guard prevents infinite tool-calling loops and forces a controlled failure when limits are reached.
- 5
MLflow tracing provides end-to-end visibility into prompts, tool selection, tool arguments, tool outputs, and final responses via recorded traces.
- 6
FastAPI wraps the agent in an async SSE endpoint so clients receive incremental JSON events rather than a single blocking response.
- 7
Multi-stage Docker builds and environment-variable model switching enable the same API to run locally with Ollama and in production with Gemini 2.5 Flash.