Image Annotation with LLava & Ollama
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use a local vision-language model (LLaVA 1.6 via Ollama) to generate per-screenshot captions so a large screenshot folder becomes searchable.
Briefing
A practical way to turn a cluttered screenshot folder into a searchable archive is to run a local vision-language model over each image and save the resulting captions to a CSV. The workflow targets a common pain point: manually scrolling through years of screenshots to find a specific moment, UI, receipt, or diagram. By generating per-image descriptions automatically, later keyword search becomes feasible—and the groundwork is laid for a more powerful “ask questions about my screenshots” setup using RAG.
The approach starts with a simple folder-to-table pipeline. A script lists screenshot files from a designated directory (the example uses PNGs), sorts them, and then processes each image that hasn’t already been annotated. To avoid rework, it checks for an existing CSV (named image_descriptions.csv). If the CSV exists, it loads prior results and skips images already present; if not, it creates a new dataframe with two columns: image file and description. Each image is loaded, converted to bytes, and sent along with a prompt to Ollama’s LLaVA 1.6 model for caption generation.
Model choice matters for both quality and speed. The example walks through LLaVA 1.6 variants at different parameter sizes: 7B, 13B, and 34B. In testing, the 7B model often misses obvious details, while the 13B model improves overall understanding of the image. The 34B model is described as the best performer, but it may be too slow or too demanding to run depending on hardware. The script is designed for local use, with the author noting a machine with 32GB RAM and that the larger model is slower than the 13B option.
The prompt is customized to the annotation goal. The baseline instruction is to “Describe this image and make sure to include anything notable about it,” with an emphasis on text in the image (“include text in the image”). That matters because screenshots frequently contain the information people want to retrieve later—like product names, logos, or UI labels. Still, OCR-like accuracy isn’t guaranteed, especially with smaller models, so the guidance is to iterate on prompts for specific use cases. If the goal is something more specialized—such as checking content categories—prompt tuning becomes part of the process.
After generation, the script streams the model output, assembles the full response, and writes it back into the dataframe as a new row keyed by the image filename. The final CSV can be opened in Excel or Google Sheets, or fed into downstream systems. The transcript also points to natural extensions: storing additional metadata such as file creation/modification dates and user information, then using that metadata to narrow searches when many images are similar. A next step mentioned is integrating the captions and metadata into a custom RAG system so users can query their screenshot archive with natural language using fully open-source local models.
Cornell Notes
The workflow turns a local screenshot folder into a searchable dataset by generating captions with Ollama’s LLaVA 1.6 vision-language models and saving results to a CSV. A script scans a directory for images, converts each image to bytes, and calls the model with a prompt designed to capture notable details—especially text visible in screenshots. It avoids duplicate work by loading an existing image_descriptions.csv and skipping files already annotated. Model size trades off accuracy and speed: 7B can miss obvious details, 13B improves understanding, and 34B is best but may be slow or hardware-heavy. The resulting CSV can support keyword search now and can later feed into RAG for question-answering over screenshots.
How does the script prevent re-annotating screenshots every time it runs?
Why convert PNG images to bytes before sending them to LLaVA?
What trade-offs come with using different LLaVA 1.6 model sizes (7B, 13B, 34B)?
How does prompt design affect the usefulness of the annotations for later search?
What does the output format enable immediately, and what does it enable later?
What metadata enhancements were suggested for a more advanced version?
Review Questions
- What steps in the pipeline ensure that only unprocessed images are sent to the model?
- How do model size and prompt wording jointly influence caption quality, especially for screenshots containing text?
- What additional metadata would most improve retrieval when you have many similar screenshots, and how would it be used in search or RAG?
Key Points
- 1
Use a local vision-language model (LLaVA 1.6 via Ollama) to generate per-screenshot captions so a large screenshot folder becomes searchable.
- 2
Scan a screenshot directory, sort filenames, and process only images not already present in image_descriptions.csv to avoid duplicate work.
- 3
Convert images to bytes before sending them to the model, which helps with PNG handling in the described setup.
- 4
Choose model size based on hardware and accuracy needs: 7B is faster but misses details, 13B improves understanding, and 34B is best but slower.
- 5
Customize prompts to your retrieval goal, including explicit instructions to capture text visible in screenshots.
- 6
Save results to a CSV for immediate use in spreadsheets and keyword search, then extend to databases/vector stores for RAG-based Q&A.
- 7
Add metadata like creation/modification dates and owner information to enable more precise filtering and better retrieval when images are similar.