Get AI summaries of any video or article — Sign up free
Build Generative AI Apps with Docker And  Hugging Face's Docker Spaces thumbnail

Build Generative AI Apps with Docker And Hugging Face's Docker Spaces

Krish Naik·
4 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Build a FastAPI service with a health-check homepage route and a /generate endpoint that calls a Transformers text-generation pipeline.

Briefing

A practical path to ship a text-generation generative AI app is laid out end-to-end: build a FastAPI service that wraps a Hugging Face Transformers text-generation pipeline, dockerize it with a reproducible Python environment, then deploy it to Hugging Face Spaces so it runs automatically on commit. The payoff is immediate—once the Space finishes building, users get both a working API endpoint and FastAPI’s Swagger UI to test generation requests in the browser.

The workflow starts with creating a local Python environment (Python 3.9) and installing the core dependencies needed for an LLM-backed web service. The project uses FastAPI to expose HTTP routes, Transformers to access model inference, and PyTorch (torch) as the underlying compute library. A requirements.txt file is created to pin the needed packages, including FastAPI, requests, sentencepiece, torch, and Transformers. After installing dependencies in the virtual environment, the next step is writing the application code.

The FastAPI app defines a simple homepage route that returns a “hello world” message to confirm the service is alive. It also defines a /generate endpoint that accepts input text and feeds it into a Transformers pipeline configured for text-to-text generation. For the model choice, the walkthrough uses flan-t5-small (a smaller variant intended to fit free-tier resource limits on Hugging Face Spaces). The pipeline output is returned as JSON, with the generated text extracted from the pipeline’s response structure (noted as generator_text in the model’s output). FastAPI’s built-in Swagger documentation becomes the interactive control panel once deployed.

Dockerization is treated as the central engineering step. A Dockerfile is created from the official Python 3.9 image, sets a working directory, copies requirements.txt into the container, and installs dependencies using pip install with caching disabled. The Dockerfile also creates and switches to a non-root user for safer execution, then copies the application code into the container. Finally, it starts the FastAPI app via uvicorn, binding to 0.0.0.0 on a specified port (the walkthrough uses 7860).

Deployment happens through Hugging Face Spaces. A new Space is created using a Docker template, then the app.py, Dockerfile, and requirements.txt are uploaded to the Space repository. Each commit triggers an automated Docker build and container run. If the build succeeds, the homepage endpoint returns “hello world,” confirming the container is executing correctly. From there, users can open a direct URL or navigate to /docs to access Swagger UI and test /generate.

In the Swagger interface, sample prompts like “tell me about machine learning” or “how to be happy” produce completed sentences, demonstrating text-to-text generation powered by Transformers and the flan-t5-small model. The overall message is less about model novelty and more about operational reliability: containerizing the FastAPI + Transformers stack makes the generative AI app portable and deployable with minimal friction.

Cornell Notes

The project builds a text-generation API using FastAPI and Hugging Face Transformers, then packages it into a Docker container for deployment on Hugging Face Spaces. A Transformers text-to-text pipeline is initialized with flan-t5-small to fit the limited free-tier resources. FastAPI exposes a homepage route for health checking and a /generate route that accepts input text and returns generated text as JSON. The Dockerfile pins the runtime (Python 3.9), installs dependencies from requirements.txt, runs under a non-root user, and starts the service with uvicorn. After uploading and committing to a Docker-based Space, the Space auto-builds and the app becomes testable via Swagger UI at /docs.

How does the app turn user text into generated output?

A Transformers pipeline is created for text-to-text generation using the flan-t5-small model. The FastAPI /generate endpoint takes an input string, passes it into the pipeline, and returns the generated result as JSON. The walkthrough notes that the model output includes a field named generator_text, which is used as the response content.

Why is requirements.txt central to both local development and deployment?

requirements.txt lists the exact libraries the service needs: FastAPI for the API layer, requests for HTTP interactions, sentencepiece for tokenization, torch for model execution, and Transformers for the pipeline. During Docker builds, the Dockerfile copies requirements.txt into the image and runs pip install -r requirements.txt, ensuring the container has the same dependencies every time.

What does the Dockerfile accomplish for a Hugging Face Space deployment?

It creates a reproducible runtime environment: it starts from the official Python 3.9 image, sets a working directory, installs dependencies from requirements.txt with caching disabled, creates a non-root user, copies the app code into the container, and launches the FastAPI app using uvicorn bound to 0.0.0.0 on port 7860. This makes the Space build-and-run process deterministic.

How do users verify the service is working after deployment?

After the Space finishes building, the homepage route returns a “hello world” message, confirming the container started successfully. For deeper verification, users open Swagger UI at /docs and use the “Try it out” button on the /generate endpoint to submit prompts and observe generated text in the response.

Why choose flan-t5-small in this setup?

The walkthrough targets Hugging Face Spaces free resources, which are limited in CPU and memory. flan-t5-small is positioned as a smaller model that is more likely to run within those constraints while still demonstrating text completion behavior.

Review Questions

  1. What routes does the FastAPI app expose, and what does each one return?
  2. Which dependencies must be included in requirements.txt for the Transformers pipeline to run, and why?
  3. How does the Dockerfile ensure the container starts the FastAPI service correctly in a cloud environment?

Key Points

  1. 1

    Build a FastAPI service with a health-check homepage route and a /generate endpoint that calls a Transformers text-generation pipeline.

  2. 2

    Use requirements.txt to list FastAPI, Transformers, torch, and tokenization dependencies so Docker builds install everything automatically.

  3. 3

    Initialize a Transformers pipeline with flan-t5-small to fit Hugging Face Spaces free-tier resource limits.

  4. 4

    Dockerize the app with a Dockerfile that installs dependencies, runs as a non-root user, and starts uvicorn on the expected port.

  5. 5

    Deploy to a Hugging Face Space configured for Docker so commits trigger automatic image builds and container runs.

  6. 6

    Validate deployment by checking the homepage response and testing /generate through Swagger UI at /docs.

Highlights

FastAPI + Transformers can be wrapped into a simple /generate endpoint that returns JSON with generated text (generator_text).
A Dockerfile built from Python 3.9 plus requirements.txt makes the deployment repeatable on Hugging Face Spaces.
Swagger UI (/docs) turns the deployed Space into an interactive text-generation console without extra frontend work.
Using flan-t5-small is a deliberate choice to keep inference feasible on free-tier Space resources.

Topics

Mentioned