Image Annotation with LLava & Ollama

TL;DR

Use a local vision-language model (LLaVA 1.6 via Ollama) to generate per-screenshot captions so a large screenshot folder becomes searchable.

Briefing Cornell Notes

Briefing

A practical way to turn a cluttered screenshot folder into a searchable archive is to run a local vision-language model over each image and save the resulting captions to a CSV. The workflow targets a common pain point: manually scrolling through years of screenshots to find a specific moment, UI, receipt, or diagram. By generating per-image descriptions automatically, later keyword search becomes feasible—and the groundwork is laid for a more powerful “ask questions about my screenshots” setup using RAG.

The approach starts with a simple folder-to-table pipeline. A script lists screenshot files from a designated directory (the example uses PNGs), sorts them, and then processes each image that hasn’t already been annotated. To avoid rework, it checks for an existing CSV (named image_descriptions.csv). If the CSV exists, it loads prior results and skips images already present; if not, it creates a new dataframe with two columns: image file and description. Each image is loaded, converted to bytes, and sent along with a prompt to Ollama’s LLaVA 1.6 model for caption generation.

Model choice matters for both quality and speed. The example walks through LLaVA 1.6 variants at different parameter sizes: 7B, 13B, and 34B. In testing, the 7B model often misses obvious details, while the 13B model improves overall understanding of the image. The 34B model is described as the best performer, but it may be too slow or too demanding to run depending on hardware. The script is designed for local use, with the author noting a machine with 32GB RAM and that the larger model is slower than the 13B option.

The prompt is customized to the annotation goal. The baseline instruction is to “Describe this image and make sure to include anything notable about it,” with an emphasis on text in the image (“include text in the image”). That matters because screenshots frequently contain the information people want to retrieve later—like product names, logos, or UI labels. Still, OCR-like accuracy isn’t guaranteed, especially with smaller models, so the guidance is to iterate on prompts for specific use cases. If the goal is something more specialized—such as checking content categories—prompt tuning becomes part of the process.

After generation, the script streams the model output, assembles the full response, and writes it back into the dataframe as a new row keyed by the image filename. The final CSV can be opened in Excel or Google Sheets, or fed into downstream systems. The transcript also points to natural extensions: storing additional metadata such as file creation/modification dates and user information, then using that metadata to narrow searches when many images are similar. A next step mentioned is integrating the captions and metadata into a custom RAG system so users can query their screenshot archive with natural language using fully open-source local models.

Cornell Notes

The workflow turns a local screenshot folder into a searchable dataset by generating captions with Ollama’s LLaVA 1.6 vision-language models and saving results to a CSV. A script scans a directory for images, converts each image to bytes, and calls the model with a prompt designed to capture notable details—especially text visible in screenshots. It avoids duplicate work by loading an existing image_descriptions.csv and skipping files already annotated. Model size trades off accuracy and speed: 7B can miss obvious details, 13B improves understanding, and 34B is best but may be slow or hardware-heavy. The resulting CSV can support keyword search now and can later feed into RAG for question-answering over screenshots.

How does the script prevent re-annotating screenshots every time it runs?

It looks for an existing CSV named image_descriptions.csv in the working folder. If the CSV exists, it loads it into a pandas dataframe and checks whether each image filename is already present. Images already listed in the dataframe are skipped; only new files get processed and appended. If the CSV doesn’t exist, it creates a new dataframe with columns for the image file and its description, then writes the updated results back to the same CSV at the end.

Why convert PNG images to bytes before sending them to LLaVA?

The transcript notes that Ollama appeared to handle JPEGs more naturally than PNGs. Converting the image to bytes makes the input format easier for the generate call to consume. The script loads each PNG with PIL, converts it to a bytes representation, and passes those bytes to the LLaVA 1.6 model along with a prompt.

What trade-offs come with using different LLaVA 1.6 model sizes (7B, 13B, 34B)?

Smaller models are faster but less reliable. The 7B variant often misses details that seem obvious, and performance varies by image. The 13B model is described as better at understanding more of what’s in the image. The 34B model is presented as the best option for caption quality, but it can be too slow or too demanding to run locally. The author mentions a 32GB RAM machine where running 34B is possible, but slower than 13B.

How does prompt design affect the usefulness of the annotations for later search?

The prompt is tailored to what the user needs to retrieve. The example prompt asks for a description and explicitly requests inclusion of any text in the image. That’s important because screenshots often contain the exact terms people search later (logos, UI labels, receipt items). The transcript also warns that smaller models may not capture text perfectly, so prompt iteration is encouraged—especially for specialized tasks like content checks.

What does the output format enable immediately, and what does it enable later?

Immediately, the script produces a CSV that maps each image filename to a generated description, which can be searched via keyword search or opened in tools like Excel or Google Sheets. Later, those captions can be stored in a database or vector store and used in a RAG system so users can ask questions about their screenshot archive and get answers grounded in the image descriptions.

What metadata enhancements were suggested for a more advanced version?

An advanced extension adds file-level metadata such as creation or modification dates, and potentially the owner/username of who saved the image. When combined with RAG, this metadata supports more targeted queries—for example, searching for images from a specific date. This is especially helpful when many screenshots are visually similar.

Review Questions

What steps in the pipeline ensure that only unprocessed images are sent to the model?
How do model size and prompt wording jointly influence caption quality, especially for screenshots containing text?
What additional metadata would most improve retrieval when you have many similar screenshots, and how would it be used in search or RAG?

Key Points

1
Use a local vision-language model (LLaVA 1.6 via Ollama) to generate per-screenshot captions so a large screenshot folder becomes searchable.
2
Scan a screenshot directory, sort filenames, and process only images not already present in image_descriptions.csv to avoid duplicate work.
3
Convert images to bytes before sending them to the model, which helps with PNG handling in the described setup.
4
Choose model size based on hardware and accuracy needs: 7B is faster but misses details, 13B improves understanding, and 34B is best but slower.
5
Customize prompts to your retrieval goal, including explicit instructions to capture text visible in screenshots.
6
Save results to a CSV for immediate use in spreadsheets and keyword search, then extend to databases/vector stores for RAG-based Q&A.
7
Add metadata like creation/modification dates and owner information to enable more precise filtering and better retrieval when images are similar.

Highlights

Captions generated locally with Ollama + LLaVA 1.6 can turn a massive screenshot folder into a structured, searchable dataset.

The script’s skip logic relies on an existing image_descriptions.csv, preventing repeated model calls for already-annotated images.

Prompting for “include text in the image” targets the exact information people usually need from screenshots, even though smaller models may still miss it.

Model size is a practical lever: 7B often overlooks obvious details, 13B is more reliable, and 34B delivers the strongest results at a higher runtime cost.

Adding metadata like modification dates enables date-specific retrieval and improves downstream RAG queries.

Topics

Screenshot Annotation
Ollama
LLaVA 1.6
Prompt Engineering
RAG Integration