Get AI summaries of any video or article — Sign up free
Create Your Own Microsoft Recall AI Feature with RAG? thumbnail

Create Your Own Microsoft Recall AI Feature with RAG?

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The system captures screenshots only when screen content changes by at least 5% in pixel difference, using OpenCV to reduce redundant saves.

Briefing

A practical “Recall”-style system can be built by combining automated screen capture, AI-based interpretation of what’s on-screen, and a retrieval layer that lets users search past activity by natural language. The core idea is to avoid saving every frame: the setup only records a new screenshot when the screen changes enough, then uses a vision-capable GPT model to extract actions and URLs from that image, storing both the screenshot and a text “history” entry. Later, a RAG (retrieval-augmented generation) pipeline embeds that history locally and retrieves the most relevant past events—along with the exact screenshot—when someone asks questions like what they did on a specific site.

The workflow is split into three phases. In the record phase, a script continuously monitors pixel differences between the current screen and the previous screenshot. If at least 5% of pixels change, it captures a new screenshot and saves it with a custom, AI-generated filename meant to function like a searchable label. That filename is produced by GPT-4o and is derived from the analyzed on-screen description, so later queries can match keywords rather than random IDs. The analyze phase then sends each saved screenshot to GPT-4o to extract the most important information: what interactions occurred, any visible URLs, and the associated screenshot name. The extracted text is appended to an archive (“history text”), and the screenshot itself is stored in a dedicated folder so the user can “rewind” to a frozen moment.

The retrieval phase turns the archive into a searchable knowledge base. The system creates embeddings from the history text using local embedding models, then uses Llama 3 to search within that RAG space. A user can ask a question such as “Did I visit Discord yesterday?” and the system returns the relevant history entry plus the screenshot filename, enabling the user to open the exact captured image. The transcript emphasizes that this approach mirrors the conceptual mechanism behind Microsoft Recall: store time-stamped evidence, summarize it into retrievable text, and connect queries to the underlying artifacts.

A key implementation detail is chunking: each extracted action is broken into segments capped at about 1,000 characters and appended to the history text so the embedding step can index the full sequence of user activity. The system also includes a “compare screenshots” step using OpenCV to compute pixel diffs, and it runs in a loop with a short startup delay. In a live demonstration, the script captures activity while the user browses sites and opens content; later, the RAG layer successfully answers questions about reading posts and identifies related screenshots by matching query terms to the AI-generated filenames.

Despite working as a prototype, the builder warns against active use because the vision analysis relies on GPT-4o via an API, which can transmit proprietary or private data. The project was initially intended to run fully locally, but performance and stability issues with local vision models forced a hybrid approach. Still, the result shows a feasible path toward a more privacy-preserving, local Recall-like system—provided a sufficiently strong open or local vision model becomes available.

Cornell Notes

The project builds a Recall-like workflow using three stages: screen capture, AI-based screenshot interpretation, and RAG-based retrieval. A script monitors pixel changes and only saves a new screenshot when the screen differs by at least 5%, reducing redundant captures. Each screenshot is sent to GPT-4o to extract key details—user actions and any visible URLs—and the system stores both the extracted text in a “history text” archive and the screenshot itself under an AI-generated, searchable filename. Later, local embeddings index the history text and Llama 3 retrieves the most relevant past events, returning the associated screenshot. This matters because it enables natural-language “search your past activity” with evidence, while also highlighting privacy tradeoffs when vision analysis uses an external API.

How does the system decide when to capture a new screenshot instead of saving every frame?

It compares the current screen to the previous screenshot using OpenCV and computes a pixel-difference percentage. Only when the difference reaches a threshold of 5% does it capture and save a new screenshot. If the user leaves the screen unchanged, pixel diffs stay low and no new screenshot is taken, preventing storage and processing spam.

What information gets extracted from each screenshot, and why is it useful for search?

GPT-4o analyzes each saved image to extract the most important information about what happened on-screen, including user interactions and any URLs visible on websites. That extracted text is appended to a history archive, turning raw pixels into searchable language. The system also uses GPT-4o to generate a concise, relevant filename from the description, so later queries can match keywords tied to the screenshot.

How does the RAG phase connect a natural-language question to the correct screenshot?

The RAG phase embeds the history text (which contains chunked action summaries) using local embedding models, then uses Llama 3 to search the embedding space. The retrieved result includes the screenshot name/label, which the user can use to open the corresponding PNG from the archive. In the demo, questions about GPT-2 on X led to retrieval of the correct screenshot file labeled for that activity.

Why chunk the extracted history text into ~1,000-character pieces?

Chunking makes the archive easier to embed and retrieve effectively. Each extracted action is divided into maximum ~1,000-character segments and appended to the history text so the embedding step can index the sequence of events in manageable units. This improves retrieval granularity when searching across many captured moments.

What privacy concern limits practical use of this prototype?

The screenshot interpretation step relies on GPT-4o through an API, meaning proprietary or private on-screen content may be transmitted externally. The builder explicitly recommends against active use for that reason and treats the setup as a prototype until a strong local vision model can replace the API call.

Review Questions

  1. What threshold and method does the system use to trigger new screenshots, and what problem does that solve?
  2. How do the AI-generated screenshot filenames influence retrieval accuracy in the RAG search?
  3. Trace the data flow from a captured PNG to a user query returning the correct screenshot—what gets embedded and what gets retrieved?

Key Points

  1. 1

    The system captures screenshots only when screen content changes by at least 5% in pixel difference, using OpenCV to reduce redundant saves.

  2. 2

    GPT-4o extracts on-screen actions and visible URLs from each screenshot and appends the results to a searchable “history text” archive.

  3. 3

    Each screenshot is saved with an AI-generated, concise filename derived from the extracted description to support keyword-based retrieval later.

  4. 4

    RAG retrieval is built by embedding the history text with local embedding models and searching with Llama 3 to map questions to past events.

  5. 5

    Chunking extracted history into ~1,000-character segments helps embedding and retrieval across long activity timelines.

  6. 6

    The prototype works as a Recall-like evidence-and-search system, but active use is discouraged because vision analysis uses an external GPT-4o API.

  7. 7

    A fully local version is the intended future direction, but local vision performance and stability issues prevented it in this build.

Highlights

A 5% pixel-diff threshold prevents the system from spamming screenshots when nothing changes on-screen.
GPT-4o generates searchable screenshot filenames from the analyzed content, turning images into retrievable artifacts.
Natural-language queries can return both a relevant history entry and the exact PNG screenshot tied to that moment.
The RAG index is built from embedded “history text,” not from raw images alone.
The prototype demonstrates feasibility but flags privacy risk because screenshot analysis uses an API.

Topics

Mentioned

  • RAG
  • GPT
  • GPT-4o
  • OpenCV
  • LLM
  • PNG
  • API