Create Your Own Microsoft Recall AI Feature with RAG?
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The system captures screenshots only when screen content changes by at least 5% in pixel difference, using OpenCV to reduce redundant saves.
Briefing
A practical “Recall”-style system can be built by combining automated screen capture, AI-based interpretation of what’s on-screen, and a retrieval layer that lets users search past activity by natural language. The core idea is to avoid saving every frame: the setup only records a new screenshot when the screen changes enough, then uses a vision-capable GPT model to extract actions and URLs from that image, storing both the screenshot and a text “history” entry. Later, a RAG (retrieval-augmented generation) pipeline embeds that history locally and retrieves the most relevant past events—along with the exact screenshot—when someone asks questions like what they did on a specific site.
The workflow is split into three phases. In the record phase, a script continuously monitors pixel differences between the current screen and the previous screenshot. If at least 5% of pixels change, it captures a new screenshot and saves it with a custom, AI-generated filename meant to function like a searchable label. That filename is produced by GPT-4o and is derived from the analyzed on-screen description, so later queries can match keywords rather than random IDs. The analyze phase then sends each saved screenshot to GPT-4o to extract the most important information: what interactions occurred, any visible URLs, and the associated screenshot name. The extracted text is appended to an archive (“history text”), and the screenshot itself is stored in a dedicated folder so the user can “rewind” to a frozen moment.
The retrieval phase turns the archive into a searchable knowledge base. The system creates embeddings from the history text using local embedding models, then uses Llama 3 to search within that RAG space. A user can ask a question such as “Did I visit Discord yesterday?” and the system returns the relevant history entry plus the screenshot filename, enabling the user to open the exact captured image. The transcript emphasizes that this approach mirrors the conceptual mechanism behind Microsoft Recall: store time-stamped evidence, summarize it into retrievable text, and connect queries to the underlying artifacts.
A key implementation detail is chunking: each extracted action is broken into segments capped at about 1,000 characters and appended to the history text so the embedding step can index the full sequence of user activity. The system also includes a “compare screenshots” step using OpenCV to compute pixel diffs, and it runs in a loop with a short startup delay. In a live demonstration, the script captures activity while the user browses sites and opens content; later, the RAG layer successfully answers questions about reading posts and identifies related screenshots by matching query terms to the AI-generated filenames.
Despite working as a prototype, the builder warns against active use because the vision analysis relies on GPT-4o via an API, which can transmit proprietary or private data. The project was initially intended to run fully locally, but performance and stability issues with local vision models forced a hybrid approach. Still, the result shows a feasible path toward a more privacy-preserving, local Recall-like system—provided a sufficiently strong open or local vision model becomes available.
Cornell Notes
The project builds a Recall-like workflow using three stages: screen capture, AI-based screenshot interpretation, and RAG-based retrieval. A script monitors pixel changes and only saves a new screenshot when the screen differs by at least 5%, reducing redundant captures. Each screenshot is sent to GPT-4o to extract key details—user actions and any visible URLs—and the system stores both the extracted text in a “history text” archive and the screenshot itself under an AI-generated, searchable filename. Later, local embeddings index the history text and Llama 3 retrieves the most relevant past events, returning the associated screenshot. This matters because it enables natural-language “search your past activity” with evidence, while also highlighting privacy tradeoffs when vision analysis uses an external API.
How does the system decide when to capture a new screenshot instead of saving every frame?
What information gets extracted from each screenshot, and why is it useful for search?
How does the RAG phase connect a natural-language question to the correct screenshot?
Why chunk the extracted history text into ~1,000-character pieces?
What privacy concern limits practical use of this prototype?
Review Questions
- What threshold and method does the system use to trigger new screenshots, and what problem does that solve?
- How do the AI-generated screenshot filenames influence retrieval accuracy in the RAG search?
- Trace the data flow from a captured PNG to a user query returning the correct screenshot—what gets embedded and what gets retrieved?
Key Points
- 1
The system captures screenshots only when screen content changes by at least 5% in pixel difference, using OpenCV to reduce redundant saves.
- 2
GPT-4o extracts on-screen actions and visible URLs from each screenshot and appends the results to a searchable “history text” archive.
- 3
Each screenshot is saved with an AI-generated, concise filename derived from the extracted description to support keyword-based retrieval later.
- 4
RAG retrieval is built by embedding the history text with local embedding models and searching with Llama 3 to map questions to past events.
- 5
Chunking extracted history into ~1,000-character segments helps embedding and retrieval across long activity timelines.
- 6
The prototype works as a Recall-like evidence-and-search system, but active use is discouraged because vision analysis uses an external GPT-4o API.
- 7
A fully local version is the intended future direction, but local vision performance and stability issues prevented it in this build.