Deploy LayoutLMv3 for Document Classification using Streamlit, Transformers and HuggingFace Spaces
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The Streamlit app classifies uploaded document images by running OCR to extract words and bounding boxes, then feeding them into a LayoutLMv3 processor/model pipeline.
Briefing
A Streamlit web app is built to classify document images using a fine-tuned LayoutLMv3 model, then deployed to Hugging Face Spaces so anyone can upload a document and get both the predicted document type and a probability breakdown. The workflow turns a trained document classifier into an interactive demo: users upload a JPEG/PNG, the app runs OCR to extract words and bounding boxes, feeds those into the LayoutLMv3 processor/model, and returns the top label plus a confidence distribution visualized as a bar chart.
The implementation starts with a clean project setup: a Python 3.10.9 virtual environment, then installs core ML and app dependencies including PyTorch, Transformers, and Streamlit. Because Hugging Face Spaces has constraints on Streamlit versions, the app pins Streamlit to a maximum supported version (noted as 1.15.2 at the time of setup). Development tooling is configured for formatting and linting (Black, Flake8, isort), but the functional heart of the project is the inference pipeline.
Model loading is structured around Streamlit caching to avoid re-initializing heavy components on every interaction. The app defines cached constructors for the OCR reader (configured for English), the LayoutLMv3 processor (built from a pre-trained tokenizer/feature extractor), and the fine-tuned model loaded from the Hugging Face Hub. A key performance detail is using an experimental Singleton-style caching approach so OCR/model/processor are created once and reused across Streamlit reruns.
For inference, the app accepts an uploaded image, opens it from in-memory bytes, and runs OCR to extract words and their bounding boxes. Those boxes are scaled to match LayoutLM’s expected coordinate system (using width/height scaling, with the transcript referencing a width scale of 1000). A helper function generates the final bounding box tensors from OCR outputs. The processor then encodes the image along with the OCR words and bounding boxes, applying padding/truncation to a max length of 512 and returning PyTorch tensors.
Prediction uses model outputs followed by softmax to convert logits into probabilities. The app selects the class with the highest probability (argmax) and maps the index back to a human-readable label via id2label. It then builds a Pandas DataFrame of class names and confidence scores, and uses Plotly Express to render a bar chart inside Streamlit, showing the probability distribution across document types.
Finally, the project is deployed to Hugging Face Spaces as a Streamlit Space. A requirements.txt file lists the runtime dependencies (pandas, plotly express, torch, Transformers, and others). After the Space finishes building, the demo is tested with multiple document images (e.g., balance sheets and income statements). Predictions run faster locally (especially with CUDA) but take longer on Spaces, with the app still producing correct labels and confidence plots. The result is a complete end-to-end document classification demo: upload → OCR + LayoutLMv3 inference → label + probability visualization → public deployment.
Cornell Notes
The app turns a fine-tuned LayoutLMv3 document classifier into an interactive Streamlit demo. Users upload a JPEG/PNG; the system runs OCR to extract words and bounding boxes, scales coordinates to LayoutLM’s expected format, encodes everything with the LayoutLMv3 processor, and performs inference with the model. Softmax converts model outputs into a probability distribution, and the top class becomes the predicted document type. A Plotly bar chart visualizes confidence across all classes, and the whole app is deployed to Hugging Face Spaces with a pinned Streamlit version and a requirements.txt file. This matters because it makes a trained document model usable by others without local setup.
How does the app convert a raw document image into inputs LayoutLMv3 can use?
Why does caching matter in a Streamlit document-classification app?
What produces the probability distribution shown in the UI?
How is the predicted label selected and displayed?
What deployment steps make the demo available on Hugging Face Spaces?
Review Questions
- What exact preprocessing steps are required between OCR output (words + bounding boxes) and the LayoutLMv3 processor call?
- How does softmax output get transformed into both a top-1 prediction and a full confidence bar chart?
- Why would the app feel slow without caching, and what components are cached to prevent repeated initialization?
Key Points
- 1
The Streamlit app classifies uploaded document images by running OCR to extract words and bounding boxes, then feeding them into a LayoutLMv3 processor/model pipeline.
- 2
Bounding boxes must be scaled to LayoutLM’s expected coordinate system before encoding; the transcript uses width/height scaling with a width scale of 1000.
- 3
Streamlit caching (Singleton-style) is used so OCR reader, processor, and model load once instead of on every UI interaction.
- 4
Inference returns logits that are converted to probabilities with softmax; argmax selects the predicted document type.
- 5
The UI shows both the predicted label and a probability distribution bar chart built from a Pandas DataFrame and Plotly Express.
- 6
Deployment to Hugging Face Spaces requires a Streamlit Space plus a requirements.txt file listing runtime dependencies and a pinned Streamlit version for compatibility.