I Built My Second Brain with AI (GPT-3)

TL;DR

Convert text into vector embeddings and store them alongside the original strings in a JSON structure to enable semantic search.

Briefing Cornell Notes

Briefing

A “second brain” built from personal notes can be made searchable by converting text into vector embeddings and then using semantic search to answer questions—even when the user can’t recall exact dates. The core mechanism is straightforward: notes are stored as text, each chunk is transformed into high-dimensional numeric vectors, and queries are matched to the closest vectors so the system can retrieve relevant memories and summarize them.

The walkthrough starts by defining semantic search through a library analogy: instead of matching exact keywords, the system compares the meaning of a query to the characteristics of stored items. Personal journal entries—such as daily activities, work, workouts, and viewing habits—are collected over 12 days, converted into a JSON structure, and embedded into vectors (described as thousands of numbers per entry). Those vectors become the searchable index.

In practice, the “ask your second brain” interface demonstrates how natural-language questions map to retrieved entries. When asked, “When did I drive my mother to the airport?” the system returns a specific date (“Friday the 13th”) and a time (“4:40”), then produces an additional summarization that can be removed if the user wants raw facts. Similar queries locate a client meeting (“Thursday the 19th of January at 11”) and confirm preferences through evidence in the notes—such as liking the movie “Her,” based on an entry that it was watched and rated “very good.”

The system also supports “memory editing” behavior. After deleting the airport-driving memory, the same question yields no matching date, effectively simulating the removal of that fact from the vector-backed store.

The second half shifts from personal journaling to building a knowledge base from external sources. Four articles about OpenAI (from 2023) are copied into a single text file, converted into vectors, and saved as an “openbrain.json” JSON dataset with thousands of lines of embeddings. Testing the knowledge base shows it can answer questions like “What happened to OpenAI in 2023?” with a summary that highlights Microsoft’s multi-year, multi-billion-dollar investment and ongoing cloud partnership details (including Azure’s role as an exclusive cloud partner). It can also extract entities into a list (including names such as Microsoft, OpenAI, ChatGPT, Sam Altman, and GitHub), though the output is described as needing prompt adjustments for better structure.

Some answers come back as uncertain when sources conflict—such as whether Microsoft will include ChatGPT in Bing—illustrating a key limitation: semantic search retrieves and summarizes what’s present in the indexed text, not a guaranteed single “truth.” The same pattern appears with pricing questions for “ChatGPT Pro,” where one source provides a price while another says it’s undisclosed, leading to a conclusion that the price isn’t definitively established.

Overall, the build demonstrates a practical alternative to fine-tuning for question answering over large text corpora: convert text to vectors, store them in JSON, and retrieve via semantic similarity. The result is a flexible, modifiable personal or domain-specific knowledge system that can be expanded by feeding in more text and continuing to iterate on prompts and summarization behavior.

Cornell Notes

The system turns notes or articles into a searchable “second brain” by converting text into vector embeddings stored in JSON. Queries are answered by semantic search: the question is embedded and matched against the closest stored vectors, then the relevant entries are summarized or returned as facts. A personal journal example shows date- and event-level retrieval (e.g., when someone drove a parent to the airport) and even “deletion” of a memory by removing an entry from the index. A second example builds an OpenAI-focused brain from 2023 articles, enabling Q&A about Microsoft’s investment and cloud partnership details, while also surfacing uncertainty when sources conflict. The approach emphasizes retrieval over fine-tuning for large-text question answering.

How does semantic search using vectors find the right memory when exact keywords aren’t known?

Instead of matching literal words, the system embeds both stored text and the user’s query into numeric vectors that represent meaning. When asked something like “When did I drive my mother to the airport?” the query is converted into an embedding and compared against embeddings for the journal entries. The closest matches—based on vector similarity—are returned, which is why the system can retrieve “Friday the 13th” and “4:40” even though the user didn’t provide the date or time.

What does the JSON + embeddings structure enable in the “second brain” workflow?

Notes are converted into a JSON object containing the original strings plus a “Vector” field holding embeddings. Those vectors act as the searchable index. With this structure, the system can run a build step to generate embeddings for a large text file (thousands of lines), then later load the JSON and answer questions by comparing query embeddings to the stored vectors.

What evidence-based behavior appears when asking about preferences or activities?

The system answers by retrieving the relevant entries that mention the topic. For example, when asked whether the user liked the movie “Her,” the system points to a journal entry dated “Friday 13th” stating the movie was watched and described as “very good.” Similarly, asking about Premier League viewing in the last 10 days returns a list of watched games with dates and scores, drawn from the stored notes.

How does “memory deletion” work in this setup?

A specific memory can be removed from the underlying stored dataset (the indexed entries). After deleting the entry about driving the mother to the airport, the same question returns no matching date from the remaining indexed items. In effect, the vector search no longer has that fact to retrieve.

Why do some answers come back as “unclear” or conflicting?

Because the knowledge base is built from multiple articles that may disagree. When asked whether Microsoft will include ChatGPT in Bing, the system returns conflicting signals and concludes there’s no clear indication. The same happens for “ChatGPT Pro” pricing: one source cites “42 dollars per month,” while another says the price isn’t disclosed, leading to an overall conclusion that the price isn’t definitively established.

What’s the practical takeaway about fine-tuning versus retrieval for Q&A?

For answering questions over a large body of text, the workflow emphasizes retrieval via embeddings rather than fine-tuning. The approach is framed as: convert big text files into vectors/JSON, then search and answer from that indexed content. Fine-tuning is treated as unnecessary when the goal is primarily question answering over existing documents.

Review Questions

When a user asks a question without providing exact dates or keywords, what two embedding steps must happen for semantic search to work?
How would you expect the system’s answer to change if you add a new article that contradicts an existing one in the indexed corpus?
What are two different ways the system demonstrates retrieval quality using the personal journal example?

Key Points

1
Convert text into vector embeddings and store them alongside the original strings in a JSON structure to enable semantic search.
2
Use natural-language queries; the system embeds the query and retrieves the closest matching stored entries based on vector similarity.
3
Journal-style notes can support factual retrieval (dates, times, events) and preference confirmation when the notes contain explicit evidence.
4
Removing an entry from the indexed dataset effectively deletes that memory from future answers.
5
Building a domain brain from external articles enables Q&A, entity extraction, and summarization over the indexed sources.
6
Conflicting sources produce uncertain or mixed answers because retrieval reflects what’s present in the corpus, not a single authoritative truth.
7
For question answering over large text collections, retrieval with embeddings is positioned as a practical alternative to fine-tuning.

Highlights

Semantic search retrieves memories by meaning, not exact keyword matches, using vector embeddings stored in JSON.

The system can “delete” a memory: after removing the airport-driving entry, the same question returns no matching date.

An OpenAI-focused knowledge base built from 2023 articles can answer questions about Microsoft’s multi-year investment and Azure partnership details.

When sources disagree—such as ChatGPT in Bing or ChatGPT Pro pricing—the system returns uncertainty rather than forcing a single answer.

Topics

Semantic Search
Vector Embeddings
Second Brain
JSON Indexing
Responsible AI