Deploy LayoutLMv3 for Document Classification using Streamlit, Transformers and HuggingFace Spaces

TL;DR

The Streamlit app classifies uploaded document images by running OCR to extract words and bounding boxes, then feeding them into a LayoutLMv3 processor/model pipeline.

Briefing Cornell Notes

Briefing

A Streamlit web app is built to classify document images using a fine-tuned LayoutLMv3 model, then deployed to Hugging Face Spaces so anyone can upload a document and get both the predicted document type and a probability breakdown. The workflow turns a trained document classifier into an interactive demo: users upload a JPEG/PNG, the app runs OCR to extract words and bounding boxes, feeds those into the LayoutLMv3 processor/model, and returns the top label plus a confidence distribution visualized as a bar chart.

The implementation starts with a clean project setup: a Python 3.10.9 virtual environment, then installs core ML and app dependencies including PyTorch, Transformers, and Streamlit. Because Hugging Face Spaces has constraints on Streamlit versions, the app pins Streamlit to a maximum supported version (noted as 1.15.2 at the time of setup). Development tooling is configured for formatting and linting (Black, Flake8, isort), but the functional heart of the project is the inference pipeline.

Model loading is structured around Streamlit caching to avoid re-initializing heavy components on every interaction. The app defines cached constructors for the OCR reader (configured for English), the LayoutLMv3 processor (built from a pre-trained tokenizer/feature extractor), and the fine-tuned model loaded from the Hugging Face Hub. A key performance detail is using an experimental Singleton-style caching approach so OCR/model/processor are created once and reused across Streamlit reruns.

For inference, the app accepts an uploaded image, opens it from in-memory bytes, and runs OCR to extract words and their bounding boxes. Those boxes are scaled to match LayoutLM’s expected coordinate system (using width/height scaling, with the transcript referencing a width scale of 1000). A helper function generates the final bounding box tensors from OCR outputs. The processor then encodes the image along with the OCR words and bounding boxes, applying padding/truncation to a max length of 512 and returning PyTorch tensors.

Prediction uses model outputs followed by softmax to convert logits into probabilities. The app selects the class with the highest probability (argmax) and maps the index back to a human-readable label via id2label. It then builds a Pandas DataFrame of class names and confidence scores, and uses Plotly Express to render a bar chart inside Streamlit, showing the probability distribution across document types.

Finally, the project is deployed to Hugging Face Spaces as a Streamlit Space. A requirements.txt file lists the runtime dependencies (pandas, plotly express, torch, Transformers, and others). After the Space finishes building, the demo is tested with multiple document images (e.g., balance sheets and income statements). Predictions run faster locally (especially with CUDA) but take longer on Spaces, with the app still producing correct labels and confidence plots. The result is a complete end-to-end document classification demo: upload → OCR + LayoutLMv3 inference → label + probability visualization → public deployment.

Cornell Notes

The app turns a fine-tuned LayoutLMv3 document classifier into an interactive Streamlit demo. Users upload a JPEG/PNG; the system runs OCR to extract words and bounding boxes, scales coordinates to LayoutLM’s expected format, encodes everything with the LayoutLMv3 processor, and performs inference with the model. Softmax converts model outputs into a probability distribution, and the top class becomes the predicted document type. A Plotly bar chart visualizes confidence across all classes, and the whole app is deployed to Hugging Face Spaces with a pinned Streamlit version and a requirements.txt file. This matters because it makes a trained document model usable by others without local setup.

How does the app convert a raw document image into inputs LayoutLMv3 can use?

It first runs OCR (configured for English) to extract the document’s words and their bounding boxes. Those bounding boxes are then scaled to match LayoutLM’s coordinate expectations (the transcript references a width scale of 1000 and uses image width/height scaling). A helper function generates the final bounding box tensors from OCR outputs. The LayoutLMv3 processor encodes the image together with the OCR words and bounding boxes, applies padding/truncation to max length 512, and returns PyTorch tensors for model inference.

Why does caching matter in a Streamlit document-classification app?

OCR and model/processor initialization are expensive. Streamlit reruns the script on interaction, which would otherwise reload the OCR reader, processor, and model every time. The app uses an experimental Singleton-style caching mechanism so the OCR reader, processor, and model are created once and reused across reruns. That keeps the UI responsive after the initial load.

What produces the probability distribution shown in the UI?

The model outputs are converted to probabilities using softmax. The app takes the softmax over the model’s output logits (referenced as “widgets” in the transcript), then flattens the resulting tensor into a list of confidence values. It pairs those probabilities with class names via id2label and displays them as a bar chart using Plotly Express.

How is the predicted label selected and displayed?

The app computes argmax over the model’s output to find the index of the highest-confidence class. It then maps that index back to a human-readable document type using id2label. Streamlit renders the predicted document type as text, alongside the probability chart built from the full softmax distribution.

What deployment steps make the demo available on Hugging Face Spaces?

A new Hugging Face Space is created with a Streamlit runtime. The project’s requirements.txt is uploaded so Spaces installs the needed libraries (including pandas, plotly express, torch, and Transformers). After committing the files, the Space builds and runs the Streamlit app. The transcript notes that Streamlit must be pinned to a supported maximum version (1.15.2) for compatibility with Spaces at the time.

Review Questions

What exact preprocessing steps are required between OCR output (words + bounding boxes) and the LayoutLMv3 processor call?
How does softmax output get transformed into both a top-1 prediction and a full confidence bar chart?
Why would the app feel slow without caching, and what components are cached to prevent repeated initialization?

Key Points

1
The Streamlit app classifies uploaded document images by running OCR to extract words and bounding boxes, then feeding them into a LayoutLMv3 processor/model pipeline.
2
Bounding boxes must be scaled to LayoutLM’s expected coordinate system before encoding; the transcript uses width/height scaling with a width scale of 1000.
3
Streamlit caching (Singleton-style) is used so OCR reader, processor, and model load once instead of on every UI interaction.
4
Inference returns logits that are converted to probabilities with softmax; argmax selects the predicted document type.
5
The UI shows both the predicted label and a probability distribution bar chart built from a Pandas DataFrame and Plotly Express.
6
Deployment to Hugging Face Spaces requires a Streamlit Space plus a requirements.txt file listing runtime dependencies and a pinned Streamlit version for compatibility.

Highlights

The pipeline is image → OCR (words + bounding boxes) → scaled coordinates → LayoutLMv3 encoding → softmax probabilities → label + confidence chart.

Caching heavy components is essential: OCR reader, processor, and model are initialized once to keep predictions fast after the first run.

Hugging Face Spaces deployment hinges on dependency compatibility, including pinning Streamlit to a supported maximum version (1.15.2).

The app doesn’t just output a class; it visualizes the full probability distribution across document types with Plotly.

Topics

Document Classification
LayoutLMv3
Streamlit Deployment
OCR Bounding Boxes
Hugging Face Spaces

Mentioned

Venelin Valkov
OCR
ML
CPU
CUDA