What Your Vault Knows — Talks & Discussion

TL;DR

TF-IDF converts notes into weighted term vectors by combining term frequency with inverse document frequency to emphasize distinctive words.

Briefing Cornell Notes

Briefing

A three-part discussion on extracting more value from Obsidian vaults centered on a practical question: how can computational methods turn existing notes into better discovery, understanding, and assistance—without forcing users to manually connect everything themselves. The strongest throughline was that “relatedness” can be computed (not just guessed), that vaults can be represented as graphs and embeddings, and that language models can be wired into Obsidian workflows via reusable “skills.” Together, the ideas sketch a path from search and linking to tutoring, summarization, and even deterministic lookups.

Ben’s segment began with a classic document-similarity approach: represent each note as a weighted vector of terms using TF-IDF (term frequency–inverse document frequency). Words that appear frequently in one note but rarely across the vault get higher weights, producing an “encoding” of the note. Similarity then becomes a geometric problem: compare two vectors using cosine similarity, which measures the angle between them rather than raw distance—helpful when notes differ in length. The method is intentionally simple and intuitive, and it’s framed as a way to power “given a note, show me similar notes” inside Obsidian. Limitations surfaced quickly: it’s unlikely to work well for very short texts, and it’s a baseline that more advanced models could improve.

Emil shifted from similarity to knowledge-graph representation and visualization. The goal is a compact, navigable view of a vault’s structure—less scrolling through huge packs of text, more understanding what concepts connect and why. He described “yellow” as a graph-view system where users can define visual rules (node shapes, arrow types) and even annotate why links exist, improving intuition about the underlying graph. From there, Emil argued that language technology could automate the heavy lifting: named entity extraction, entity linking, and relation extraction. Instead of manually annotating every sentence, systems could infer entities and connections automatically, then optionally connect them to external sources like Wikipedia or Wikidata. He also referenced tools such as Codex (a web-based “operating system” for annotated writing), InfraNodus (automatic knowledge-graph construction from text), and Neo4j-based network analysis—positioning them as building blocks for an Obsidian-friendly “browse your vault as a graph” experience.

Paul’s contribution brought the discussion into hands-on automation with Duo, a virtual assistant for knowledge work integrated into Obsidian. Duo’s core mechanism is a chat interface powered by a skill system: users create markdown “skill files” that define patterns for what the assistant should do, often using placeholders that get filled from the user’s prompt or from context pulled from the vault. Examples included generating research questions, finding related concepts, creating quizzes/flash-card-style prompts from notes, and producing summaries or key points. Duo also supports deterministic actions via code blocks and external data sources—such as querying Wikidata through SPARQL-like requests—so not every task depends on free-form generation. A live demo showed skills computing expressions with JavaScript, retrieving related notes, and generating paragraphs based on vault context. The discussion also addressed risks: language models can confabulate, and fine-tuning on personal notes can bias style and content, so safety filters and careful deployment matter.

Across all three segments, the practical message was consistent: vault intelligence emerges when similarity metrics, graph representations, and language-model-driven skills are combined—turning scattered notes into navigable knowledge, and eventually into an assistant that can reason over what’s already been written.

Cornell Notes

The discussion focused on turning an Obsidian vault into something more “computable”: similar-note discovery, graph-based navigation, and an assistant that can act on vault content. Ben described a baseline similarity method using TF-IDF vectors and cosine similarity to rank notes by topic overlap, with the caveat that short texts often fail. Emil argued that knowledge-graph views can make vaults easier to browse, and that NLP can automate entity extraction, entity linking, and relation extraction so users don’t have to annotate everything manually. Paul demonstrated Duo, an Obsidian-integrated virtual assistant where markdown “skills” define reusable behaviors, including context-building from related notes and deterministic lookups via code and Wikidata queries. Together, the approaches show a pipeline from representation → retrieval → assistance.

How does TF-IDF plus cosine similarity turn two notes into a similarity score?

Each note is converted into a vector where each dimension corresponds to a term in the vault’s dictionary. Term frequency (TF) counts how often a term appears in that note. Document frequency (DF) measures how many notes across the vault contain the term, and inverse document frequency (IDF) uses n/DF(t) (with n = total number of documents) so common terms get down-weighted and rare terms get up-weighted. The resulting TF-IDF weights form the note’s vector. Similarity is then computed using cosine similarity, which measures the angle between the two vectors—capturing whether they point in the same “direction” (similar term distributions) even if one note is longer than the other.

Why is cosine similarity preferred over raw distance for note similarity?

Raw distance can be misleading because it mixes topic overlap with note length and overall magnitude. Two notes might share the same term distribution but one could contain more total words, changing vector length without changing topical focus. Cosine similarity focuses on the direction of the vectors (the relative weighting pattern), so it better reflects whether two notes are about the same subject rather than whether one is simply longer.

What’s the difference between manually annotating entities in a note and automating it with NLP?

Manual annotation (as in Codex-style writing) links entities and relations as the user types, producing a graph with explicit edges and often “why” explanations. Automating it replaces that keyboard work with NLP tasks: named entity extraction identifies entity mentions, entity linking maps mentions to canonical entities (e.g., Wikipedia/Wikidata IDs), and relation extraction infers connections between entities from surrounding text. The payoff is scalability—users can write normally in markdown while the system infers the graph structure.

How does Duo’s “skill file” approach make the assistant’s behavior controllable?

Duo uses markdown skill definitions with a structured pattern and placeholders. When a user sends a message, Duo fills placeholders (like a subject/topic) and then completes the pattern to generate the output. Skills can also compose other skills (e.g., “find related notes” feeding context into “write a paragraph based on…”). Some skills include deterministic code blocks (e.g., JavaScript) for tasks like evaluating expressions, and others can call external services such as Wikidata via structured queries.

What role does deterministic querying (e.g., Wikidata) play alongside generative text?

Generative models can produce plausible but incorrect answers. Duo’s deterministic path uses structured requests to machine-friendly data sources (Wikidata) so factual lookups like “director of the Godfather” can be retrieved via a rule-based query mechanism. The system still uses placeholders (entity/property) but relies on external structured data rather than free-form generation for those fact-heavy tasks.

Review Questions

If you had two notes with the same key terms but one is much longer, which similarity measure from the discussion would likely handle that better and why?
Explain how entity extraction, entity linking, and relation extraction differ, and which one is most directly responsible for turning text into graph edges.
In Duo’s skill system, what are placeholders used for, and how can skills be composed to build context before generating an answer?

Key Points

1
TF-IDF converts notes into weighted term vectors by combining term frequency with inverse document frequency to emphasize distinctive words.
2
Cosine similarity compares note vectors by angle, reducing sensitivity to note length and focusing on term distribution overlap.
3
Graph-based vault views aim to make relationships navigable by showing nodes, edges, and optionally “why” explanations for connections.
4
NLP automation can replace manual entity annotation by performing entity extraction, entity linking, and relation extraction to infer graph structure from text.
5
Duo’s behavior is controlled through markdown “skill files” that define patterns with placeholders filled from user prompts and/or vault context.
6
Duo can mix generative responses with deterministic actions, including code execution and structured queries to sources like Wikidata.
7
Safety and reliability remain central concerns because language models can confabulate and personal fine-tuning can introduce bias in style and content.

Highlights

Ben’s baseline method treats each note as a TF-IDF vector and ranks similarity using cosine similarity, turning “related notes” into a geometric computation.

Emil’s vision is a vault browser that shows a small, summarized graph of concepts—made feasible by automating entity and relation extraction from normal writing.

Paul’s Duo integrates a skill system into Obsidian, letting users define reusable behaviors that can both generate text and run deterministic lookups (e.g., Wikidata).

Topics

TF-IDF Similarity
Knowledge Graphs
Entity Linking
Semantic Search
Obsidian Skills

Mentioned

Ben
Emil
Paul Brickman
TF-IDF
IDF
DF
NLP
AI
GPT-3
GPT-2
SPARQL
HTTP
CPU
GPU