Get AI summaries of any video or article — Sign up free
How to Deploy LLMs | LLMOps Stack with vLLM, Docker, Grafana & MLflow thumbnail

How to Deploy LLMs | LLMOps Stack with vLLM, Docker, Grafana & MLflow

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

vLLM is used to serve the LLM with higher throughput and paged attention for more efficient GPU memory utilization.

Briefing

Running an LLM locally is only half the job; production needs concurrency, security, monitoring, and a way to detect failures. A practical LLMOps stack built around vLLM delivers those pieces by serving a quantized Hugging Face model behind an Nginx reverse proxy, instrumenting it with Prometheus + Grafana dashboards, and tracking live inference runs in MLflow—then deploying the whole setup on Vast AI via Docker Compose for low hourly cost.

At the core sits vLLM, chosen for higher throughput and efficient GPU serving. The deployment uses the “quentry” family (the transcript names “quentry the 4 billion parameter” and then “quint3 4 billion parameter instruct version” from Hugging Face). A key performance feature is paged attention, described as a memory-efficiency technique akin to Tetris-style space packing, letting the system fit more of the model into limited GPU memory. The stack also constrains runtime risk: it limits GPU memory utilization to 90% and caps the context window to 4,96 tokens (as stated), reducing the chance of out-of-memory crashes. For gated models, a Hugging Face token is passed so vLLM can download weights automatically.

Security and endpoint control come next. Nginx acts as a reverse proxy that sits in front of the vLLM API, enforcing authorization so only requests with the expected token reach the inference service. The transcript notes that the example uses a simple token check (with a warning to replace it with stronger production-grade auth such as OAuth or similar).

Observability is handled with Prometheus and Grafana. Prometheus scrapes the vLLM metrics endpoint every five seconds (the transcript references vLLM’s default metrics endpoint on port 8000) and stores those time-series metrics. Grafana then visualizes them through dashboards imported from a provided configuration, focusing on latency and throughput indicators such as time to first token, end-to-end request latency, token generation/completion counts, and token throughput. The monitoring window is shown as the last 15 minutes, with the option to extend.

Reliability is addressed with a health check loop. The vLLM container runs a periodic check every 30 seconds by calling the vLLM health endpoint with a 50-second timeout, delayed by a start period of 300 seconds to allow model download and initialization.

For experiment tracking, MLflow is added to observe inference quality and performance at the conversation level. The workflow uses LangChain to call the vLLM service through an OpenAI-compatible interface (the transcript says vLLM can simulate an OpenAI server). In the demo, each chat request is logged to MLflow with request/response details and token counts, and a new MLflow experiment appears (named “conversation” in the transcript) with a timeline of inputs and outputs.

Deployment is orchestrated with a single Docker Compose file and executed on Vast AI. The process includes selecting a GPU instance (the transcript cites a 3090 at about 28 cents), SSH-ing into the provisioned machine, cloning the repository, inserting the Hugging Face token, and running docker compose up to pull images, download model weights, and start the vLLM server. Once running, Prometheus and Grafana begin collecting metrics, and the LangChain + MLflow client can generate requests against the secured public API. The result is an end-to-end LLMOps stack aimed at production readiness: concurrency-friendly serving, basic access control, continuous monitoring, and structured inference logging.

Cornell Notes

The stack turns a local LLM into a production-style service by combining vLLM, Nginx, Prometheus, Grafana, and MLflow. vLLM provides high-throughput inference and efficient GPU memory use via paged attention, while runtime safeguards limit GPU memory to 90% and cap the context window (4,96 tokens as stated). Nginx sits in front to enforce authorization before requests reach the vLLM API, and a health check endpoint helps detect startup or runtime issues. Prometheus scrapes vLLM metrics every five seconds and Grafana renders dashboards for latency and token throughput. LangChain sends chat requests through an OpenAI-compatible base URL, and MLflow logs each conversation’s request/response and token counts for later inspection.

Why is vLLM positioned as the inference engine in this LLMOps stack?

vLLM is used for higher throughput and a production-friendly serving model. The transcript highlights paged attention as the main GPU-memory optimization, likened to Tetris-style packing that improves space efficiency so more of the model can fit on smaller GPUs. It also supports serving models from Hugging Face and can be configured to manage GPU memory usage and context length to reduce crash risk.

How does the stack reduce the chance of GPU-related crashes during inference?

It applies multiple constraints: GPU memory utilization is capped at 90%, and the context window is limited to 4,96 tokens (as stated). The transcript also notes that if the GPU still lacks enough memory for the model, no configuration can fully prevent failure. A health check waits 300 seconds before starting periodic checks, giving time for model download and initialization.

What role does Nginx play, and what security caveat is mentioned?

Nginx acts as a reverse proxy and gatekeeper. Requests arriving on the public endpoint must include the expected token (the transcript describes a simple token-based authorization example). The caveat is that this simplistic auth should be replaced in real production deployments with stronger mechanisms such as OAuth or similar approaches.

How do Prometheus and Grafana work together to monitor the running model?

Prometheus scrapes vLLM’s metrics endpoint every five seconds (targeting port 8000 in the configuration described). Grafana uses Prometheus as its data source and imports a dashboard that visualizes key performance metrics—especially time to first token, end-to-end request latency, and token throughput/completion counts. The dashboard is shown with a default lookback of the last 15 minutes.

How does MLflow fit into the inference workflow?

MLflow tracks inference runs at the conversation level. The transcript demonstrates using LangChain to call the model via an OpenAI-compatible interface exposed by the vLLM service, then logging each request/response and token counts into MLflow. Each new chat interaction creates or updates an MLflow experiment (shown as “conversation”), with a timeline of inputs and outputs.

What does the deployment process on Vast AI look like at a high level?

A Vast AI GPU instance is provisioned (the transcript cites a 3090 example). The user SSHs into the machine, clones the repository, inserts the Hugging Face token for gated models, then runs docker compose up to download container images and model weights. After containers start, vLLM begins serving, Prometheus starts scraping metrics, Grafana dashboards become available, and the secured public API can be tested with LangChain + MLflow logging.

Review Questions

  1. Which specific vLLM configuration choices in the transcript are meant to prevent out-of-memory failures, and how do they relate to context length and GPU memory utilization?
  2. How does the stack ensure that only authorized requests reach the inference service, and what production-hardening step is recommended?
  3. What metrics does Grafana visualize in the described dashboard, and how does Prometheus’s scraping interval affect the monitoring granularity?

Key Points

  1. 1

    vLLM is used to serve the LLM with higher throughput and paged attention for more efficient GPU memory utilization.

  2. 2

    Nginx provides a reverse-proxy layer that enforces token-based authorization before requests reach the vLLM inference API.

  3. 3

    Prometheus scrapes vLLM metrics every five seconds, and Grafana turns those metrics into dashboards focused on latency and token throughput.

  4. 4

    Health checks call the vLLM health endpoint every 30 seconds after a 300-second startup delay to catch failures early.

  5. 5

    GPU stability is improved by capping GPU memory utilization at 90% and limiting the context window to 4,96 tokens (as stated).

  6. 6

    MLflow logs each conversation’s request/response and token counts, enabling experiment tracking alongside live inference.

  7. 7

    Docker Compose packages the entire stack for deployment on Vast AI, including model download, container startup, and metrics/observability services.

Highlights

Paged attention is presented as the mechanism that makes vLLM memory usage more efficient—described as Tetris-like packing to fit more model into limited GPU space.
The monitoring loop is concrete: Prometheus scrapes vLLM metrics every five seconds, and Grafana dashboards visualize time-to-first-token and end-to-end latency.
Security is implemented at the edge with Nginx token authorization, with an explicit warning to replace the simple token check with stronger production auth.
The stack combines reliability and observability: a delayed health check plus Prometheus/Grafana dashboards helps detect both startup issues and ongoing performance drift.
LangChain calls the vLLM service through an OpenAI-compatible base URL, while MLflow captures each conversation as a tracked experiment.

Topics

  • LLMOps Stack
  • vLLM Serving
  • Docker Compose Deployment
  • Observability
  • MLflow Tracking

Mentioned

  • LLMOps
  • vLLM
  • GPU
  • API
  • MLflow
  • SSH
  • OAuth
  • Tetris
  • ML
  • LM
  • CAS
  • OM
  • LMOps
  • Grafana
  • Prometheus
  • Nginx
  • CPU
  • GPU