Get AI summaries of any video or article — Sign up free
3-Langchain Series-Production Grade Deployment LLM As API With Langchain And FastAPI thumbnail

3-Langchain Series-Production Grade Deployment LLM As API With Langchain And FastAPI

Krish Naik·
4 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Build an API-first layer for LLMs so web/mobile apps can integrate via stable HTTP routes instead of embedding model logic.

Briefing

Production-grade LLM deployment starts with turning model calls into stable HTTP APIs. This walkthrough builds a LangChain + LangServe backend that exposes multiple routes—one wired to OpenAI’s chat model and another wired to a local Llama 2 via Ollama—then pairs it with a Streamlit client that hits those endpoints. The practical payoff is straightforward: once the API layer exists, any web, mobile, or edge app can integrate with different LLMs through consistent URLs and request/response schemas.

The core architecture is an API “router” sitting between applications (web/mobile) and one or more LLM backends. Routes are defined so each endpoint can use a specific model and a specific prompt template. That matters because LLMs differ in strengths and performance across tasks, so the same product can route different user intents to different models. In this example, the backend defines two prompt templates: one instructs the OpenAI chat model to “write me an essay” about a provided topic (constrained to about 100 words), while the other instructs the Llama 2 model to “write me a poem” about a provided topic (also about 100 words). Each template is bound to its corresponding model when registering routes.

On the backend side, the setup installs dependencies including LangServe, FastAPI, and Uvicorn. The FastAPI app is created with metadata (title, version, description), then LangServe’s add_routes is used to register endpoints. The code loads environment variables (notably the OpenAI API key) and initializes two model objects: ChatOpenAI for OpenAI and Ollama (via langchain_community.llms) for Llama 2. After routes are added, the service is run on localhost:8000. A key operational feature appears immediately: visiting /docs provides Swagger UI with input/output schemas for each route, making the API self-documenting and easier to test.

After the API layer is live, the client side demonstrates integration. A Streamlit app (client.py) uses requests to POST JSON payloads to the LangServe invoke URLs exposed by Swagger UI. One function targets the essay endpoint (OpenAI route) and extracts the returned text from the JSON response (using the output/content fields). A second function targets the poem endpoint (Llama 2 route). Streamlit then presents two text boxes—one for essay topics and one for poem topics—so user input is sent to the correct backend route and the generated text is displayed.

The result is a clear first phase of deployment: model functionality becomes callable services with predictable endpoints, documentation, and schema validation. From there, the same API can be deployed to any server or cloud environment, while front ends (web, mobile, or edge) simply consume the API rather than embedding model logic directly.

Cornell Notes

The workflow turns LLM functionality into production-ready HTTP APIs using LangChain’s LangServe with FastAPI. Two separate routes are created: one route uses ChatOpenAI with a prompt template for generating short essays, and another route uses Llama 2 served through Ollama with a prompt template for generating short poems. Each route is registered with LangServe so the service exposes consistent invoke URLs and automatically generates Swagger UI documentation at /docs. A Streamlit client then calls those endpoints via HTTP POST, sending user-provided topics in JSON and rendering the returned text. This API-first approach makes it easier to integrate multiple LLMs into any web or mobile app without rewriting model logic.

Why create an API layer before deploying an LLM-powered application?

Because applications (web, mobile, desktop) need a stable interface for model capabilities. By exposing LLM actions as HTTP routes, the front end can call predictable URLs with JSON inputs. In this setup, LangServe registers routes in a FastAPI app, and Swagger UI at /docs shows the input/output schemas, which simplifies integration and testing. It also enables swapping or adding models later by changing route bindings rather than rewriting the client.

How do multiple models get used without changing the client logic?

Each model is bound to a different route. The backend initializes ChatOpenAI for OpenAI-based essay generation and Ollama (Llama 2) for poem generation. LangServe’s add_routes connects each route path to the correct model and prompt template. The Streamlit client calls the corresponding invoke URL for each route, so the client stays simple while the backend handles model selection.

What role do prompt templates play in the route design?

Prompt templates define the instruction format and constraints for each task. One template is used with ChatOpenAI to request an essay about a given topic (around 100 words). A second template is used with Llama 2 to request a poem about a given topic (also around 100 words). Binding templates to routes ensures that the same endpoint always performs the intended task style.

How does Swagger UI help during deployment and integration?

Swagger UI at /docs provides the API’s interactive documentation, including input schema and output schema for each route (e.g., essay and poem). That makes it easier to find the correct invoke URLs and understand what JSON payload fields are required. It also reduces guesswork when building the client that sends requests to the API.

How does the Streamlit client call the LangServe endpoints?

The client uses requests to POST JSON to the invoke URLs shown in Swagger UI. For example, it posts {

Review Questions

  1. What changes in the backend if you want to add a third route that uses a different LLM for a new task (e.g., summarization)?
  2. How would you verify that the JSON payload you send from Streamlit matches the input schema shown in Swagger UI?
  3. Why might you choose different prompt templates for the same model versus using different models for the same prompt?

Key Points

  1. 1

    Build an API-first layer for LLMs so web/mobile apps can integrate via stable HTTP routes instead of embedding model logic.

  2. 2

    Use LangServe with FastAPI to register model-specific routes and automatically generate Swagger UI documentation at /docs.

  3. 3

    Bind each route to both a specific model (ChatOpenAI or Ollama/Llama 2) and a task-specific prompt template (essay vs poem).

  4. 4

    Store secrets like the OpenAI API key in environment variables and load them in the backend before starting the server.

  5. 5

    Expose consistent invoke endpoints and have clients call them with JSON payloads containing user inputs (e.g., topic).

  6. 6

    Use a lightweight front end like Streamlit to test end-to-end integration by sending requests to the correct route and rendering the returned text.

Highlights

LangServe + FastAPI turns LLM calls into documented HTTP endpoints, with Swagger UI showing input/output schemas for each route.
Two different LLM backends—OpenAI (ChatOpenAI) and local Llama 2 via Ollama—are selected purely by route wiring, not by changing the client’s overall approach.
Prompt templates are attached to routes so each endpoint reliably produces the intended output style (essay vs poem).
A Streamlit app can act as a practical client by POSTing JSON to the LangServe invoke URLs and extracting the generated text from the response JSON.

Topics

Mentioned

  • Krish Naik
  • LLM
  • API
  • LLama
  • LLM
  • UI
  • SSE
  • JSON
  • API
  • API
  • HTTP
  • UI
  • /docs
  • POST