3-Langchain Series-Production Grade Deployment LLM As API With Langchain And FastAPI
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Build an API-first layer for LLMs so web/mobile apps can integrate via stable HTTP routes instead of embedding model logic.
Briefing
Production-grade LLM deployment starts with turning model calls into stable HTTP APIs. This walkthrough builds a LangChain + LangServe backend that exposes multiple routes—one wired to OpenAI’s chat model and another wired to a local Llama 2 via Ollama—then pairs it with a Streamlit client that hits those endpoints. The practical payoff is straightforward: once the API layer exists, any web, mobile, or edge app can integrate with different LLMs through consistent URLs and request/response schemas.
The core architecture is an API “router” sitting between applications (web/mobile) and one or more LLM backends. Routes are defined so each endpoint can use a specific model and a specific prompt template. That matters because LLMs differ in strengths and performance across tasks, so the same product can route different user intents to different models. In this example, the backend defines two prompt templates: one instructs the OpenAI chat model to “write me an essay” about a provided topic (constrained to about 100 words), while the other instructs the Llama 2 model to “write me a poem” about a provided topic (also about 100 words). Each template is bound to its corresponding model when registering routes.
On the backend side, the setup installs dependencies including LangServe, FastAPI, and Uvicorn. The FastAPI app is created with metadata (title, version, description), then LangServe’s add_routes is used to register endpoints. The code loads environment variables (notably the OpenAI API key) and initializes two model objects: ChatOpenAI for OpenAI and Ollama (via langchain_community.llms) for Llama 2. After routes are added, the service is run on localhost:8000. A key operational feature appears immediately: visiting /docs provides Swagger UI with input/output schemas for each route, making the API self-documenting and easier to test.
After the API layer is live, the client side demonstrates integration. A Streamlit app (client.py) uses requests to POST JSON payloads to the LangServe invoke URLs exposed by Swagger UI. One function targets the essay endpoint (OpenAI route) and extracts the returned text from the JSON response (using the output/content fields). A second function targets the poem endpoint (Llama 2 route). Streamlit then presents two text boxes—one for essay topics and one for poem topics—so user input is sent to the correct backend route and the generated text is displayed.
The result is a clear first phase of deployment: model functionality becomes callable services with predictable endpoints, documentation, and schema validation. From there, the same API can be deployed to any server or cloud environment, while front ends (web, mobile, or edge) simply consume the API rather than embedding model logic directly.
Cornell Notes
The workflow turns LLM functionality into production-ready HTTP APIs using LangChain’s LangServe with FastAPI. Two separate routes are created: one route uses ChatOpenAI with a prompt template for generating short essays, and another route uses Llama 2 served through Ollama with a prompt template for generating short poems. Each route is registered with LangServe so the service exposes consistent invoke URLs and automatically generates Swagger UI documentation at /docs. A Streamlit client then calls those endpoints via HTTP POST, sending user-provided topics in JSON and rendering the returned text. This API-first approach makes it easier to integrate multiple LLMs into any web or mobile app without rewriting model logic.
Why create an API layer before deploying an LLM-powered application?
How do multiple models get used without changing the client logic?
What role do prompt templates play in the route design?
How does Swagger UI help during deployment and integration?
How does the Streamlit client call the LangServe endpoints?
Review Questions
- What changes in the backend if you want to add a third route that uses a different LLM for a new task (e.g., summarization)?
- How would you verify that the JSON payload you send from Streamlit matches the input schema shown in Swagger UI?
- Why might you choose different prompt templates for the same model versus using different models for the same prompt?
Key Points
- 1
Build an API-first layer for LLMs so web/mobile apps can integrate via stable HTTP routes instead of embedding model logic.
- 2
Use LangServe with FastAPI to register model-specific routes and automatically generate Swagger UI documentation at /docs.
- 3
Bind each route to both a specific model (ChatOpenAI or Ollama/Llama 2) and a task-specific prompt template (essay vs poem).
- 4
Store secrets like the OpenAI API key in environment variables and load them in the backend before starting the server.
- 5
Expose consistent invoke endpoints and have clients call them with JSON payloads containing user inputs (e.g., topic).
- 6
Use a lightweight front end like Streamlit to test end-to-end integration by sending requests to the correct route and rendering the returned text.