KGC23 Keynote: The Future of Knowledge Graphs in a World of LLMs

TL;DR

Knowledge graphs provide factual answers through explicit entity-relation storage, making many lookup-style questions cheaper and faster than token-by-token LLM inference.

Briefing Cornell Notes

Briefing

Large language models can answer questions, but knowledge graphs deliver the same kind of factual reliability far more cheaply—especially when answers require precise lookups rather than generative computation. Denny Vrandečić, co-founder of Wikidata and a longtime knowledge-graph architect, argues that the most practical future isn’t choosing between LLMs and knowledge graphs. Instead, it’s combining them: use LLMs as an interface and orchestration layer, while knowledge graphs provide ground truth, auditability, and efficient retrieval.

Vrandečić frames the moment as a rapid adoption cycle similar to past technology inflections, pointing to how quickly “stable diffusion” reached massive user numbers. That speed, he suggests, has created a kind of shock—teams are scrambling to understand what LLMs change in how knowledge is processed and where existing systems fit. He narrows the scope to the technical relationship between knowledge graphs and LLMs, explicitly avoiding broader debates about ethics, copyright, or existential risks.

The core case for knowledge graphs comes from first principles: a knowledge graph stores entities and relationships so answering many factual queries becomes a graph lookup. A large language model, by contrast, must run inference across many layers and parameters to generate tokens—even when the question is essentially a retrieval task. Vrandečić illustrates this with a concrete example about who created Raphael’s “School of Athens.” He reports that LLMs produce fluent, contextual answers but take seconds, while Wikidata-style querying returns quickly. He then scales the comparison using cost estimates: thousands of Wikidata queries in cloud settings cost cents, while running GPT-4-class inference at similar scale can cost dollars—on the order of tens of times more.

He also argues that LLMs struggle with consistency and provenance in ways that matter for real-world knowledge. A Wikipedia “rabbit hole” about the birthplace of actress Anna Begović yields conflicting answers across Google, Wikipedia-derived sources, and Wikidata, with contamination from earlier claims complicating verification. In another example, he describes how asking an LLM-based system for the birthplace of a person can produce different answers depending on language context, even when the underlying facts should be stable. He further notes that LLMs can miss edge cases—like “mayors of cities born after 1998”—and may respond confidently while being inconsistent with earlier statements.

From there, the proposed direction is architectural. Vrandečić advocates “augmented language models,” where LLMs don’t just generate text but call external tools: knowledge-base queries, math engines, and function libraries. He points to “toolformer” as an example of using LLMs to decide which services to invoke. In this setup, knowledge graphs become both a knowledge store and a set of functions—something LLMs can query for ground truth rather than relearn facts repeatedly inside model parameters.

He adds a second, structural argument about why facts shouldn’t be internalized in model weights. With models growing large to memorize trivia-like knowledge (including multilingual entity facts), he questions whether it’s efficient to embed millions of statements that could instead live in curated, editable symbolic systems. His closing vision is a world where LLMs generate infinite content, but knowledge graphs preserve the “true story”: auditable, editable, and designed to handle uncertainty explicitly. He even suggests adding a new special value for “it’s complicated” to represent cases where knowledge lacks a single clean ground truth, enabling systems to spend more effort verifying those edges. The takeaway: LLMs are powerful language interfaces, but knowledge graphs are the infrastructure for reliable, cost-effective knowledge.

Cornell Notes

Denny Vrandečić argues that knowledge graphs remain essential in an LLM world because they provide efficient, auditable factual retrieval that generation models can’t match on cost or consistency. He contrasts graph lookup—where answers come from stored entities and relationships—with LLM inference—where producing an answer requires running through many layers to generate tokens even for simple facts. He cites examples of conflicting or inconsistent birthplace answers across languages and systems, plus edge cases where LLMs fail to find facts that a knowledge graph query can return. The proposed path forward is “augmented language models”: use LLMs as an orchestration and UX layer that calls knowledge-graph queries and other tools for ground truth. This combination aims to reduce hallucinations, improve explainability, and keep costs manageable at scale.

Why does Vrandečić claim knowledge graphs can be cheaper than LLMs for factual question answering?

He grounds the argument in first principles: knowledge graphs store entities and relations so answering many questions is a lookup in a graph database (e.g., Wikidata). LLMs must run inference across many layers/parameters to generate tokens for every answer, which becomes expensive at scale. He gives a cost comparison using cloud-style workloads: thousands of Wikidata queries are estimated in the cents range, while GPT-4-class inference for similar usage is estimated in the dollars range—roughly a 50x gap in his example. Even if LLM costs fall over time, the structural difference remains: lookup operations scale differently than repeated token generation through deep networks.

What kinds of failures does Vrandečić describe that motivate grounding answers in knowledge graphs?

He highlights inconsistency and provenance problems. In one case, birthplace claims about Anna Begović differ across Google/Wikipedia-derived sources and Wikidata, and he describes how web sources can be contaminated by earlier Wikipedia claims, making verification harder. In another, an LLM-based system returns different birthplace answers depending on language context (e.g., an answer tied to Zagreb in one language), even when the underlying fact should be stable. He also notes that LLMs can miss edge cases—like very recent “mayors born after 1998”—and may respond without detecting contradictions with earlier outputs.

How does the “augmented language model” idea change the division of labor between LLMs and knowledge graphs?

Instead of letting an LLM generate everything, the model acts as an orchestrator that decides when to call external capabilities. Vrandečić describes architectures where LLMs use tool-like services: knowledge-base queries (e.g., SPARQL-style lookups), math functions, and other “Wiki functions” or function repositories. The knowledge graph supplies ground truth and structured answers; the LLM supplies language understanding and user-facing interaction. This reduces hallucinations because the factual content comes from explicit, queryable sources rather than internalized model memory.

What does Vrandečić mean by “don’t internalize every fact in model parameters”?

He argues that large models spend many parameters on memorizing trivia-like knowledge—such as multilingual entity facts and membership lists—that could live more efficiently in curated symbolic stores. He compares model sizes (e.g., LLMs with billions of weights) to the scale of knowledge graph content (millions of entities/relations) and questions whether it’s necessary to embed those statements inside weights. His alternative: use knowledge graphs as extraction and storage infrastructure—extract facts into symbolic form, curate/edit them, and then query them reliably rather than retraining or relearning repeatedly.

Why propose a special value like “it’s complicated” in knowledge graphs?

He suggests knowledge graphs need an explicit representation for cases where there isn’t a single clean ground truth. Existing special values can represent “no value” (e.g., someone has no child) or “unknown value” (e.g., father is unknown). “It’s complicated” would signal that the situation requires extra reasoning or verification. In an LLM-connected system, that marker could trigger the model to spend more effort checking that edge case, and it would help systems communicate uncertainty more accurately.

How does language affect LLM-based answers in Vrandečić’s examples?

He reports that the same birthplace question can yield different answers depending on the language used in the prompt. For example, an LLM-based system gives an answer tied to Zagreb in one language context, even when no single source on the web clearly supports that exact connection. He contrasts this with Wikidata queries, which return distributions of birthplaces based on structured entity data. The implication is that LLM outputs can reflect learned correlations or prompt-language effects rather than stable, queryable facts.

Review Questions

In Vrandečić’s framework, what specific tasks should be handled by knowledge graphs versus by LLMs?
What evidence does he give that LLM answers can vary by language or fail on edge cases, and why does that matter for reliability?
How would an “augmented language model” reduce hallucinations compared with a pure text-generation approach?

Key Points

1
Knowledge graphs provide factual answers through explicit entity-relation storage, making many lookup-style questions cheaper and faster than token-by-token LLM inference.
2
LLMs can generate fluent context, but their inference cost and repeated computation make them inefficient for large-scale factual retrieval.
3
LLM outputs can be inconsistent across systems and languages, and they may cite sources that don’t actually support the claimed facts.
4
A practical path forward is “augmented language models” that orchestrate tool calls—knowledge-graph queries, math engines, and function libraries—rather than generating everything internally.
5
Embedding millions of factual statements inside model weights is inefficient when curated, editable symbolic knowledge stores already exist.
6
Knowledge graphs can improve auditability and explainability by grounding answers in queryable, structured data rather than opaque model memory.
7
Representing uncertainty explicitly (e.g., a proposed “it’s complicated” value) can help systems handle cases where truth isn’t a single clean value.

Highlights

The cost gap isn’t just about hardware—it follows from the structural difference between graph lookup and deep-network token generation.

Birthplace examples show how LLM answers can shift with language and can conflict with structured sources, undermining reliability.

The proposed architecture turns LLMs into an orchestration/UX layer that calls knowledge-graph queries for ground truth.

Vrandečić argues that facts shouldn’t be repeatedly relearned inside model parameters when knowledge graphs can store, curate, and audit them once.

Topics

Knowledge Graphs
Large Language Models
Wikidata
Augmented Language Models
SPARQL Queries

Mentioned

Denny Vrandečić
Jamie Taylor
Sam Altman
John Hennessy
Raphael
Anna Begović
Idris Elba
Michelle Helen Mirren
Lianda Koon
Michael
LLMs
GPT
SPARQL
CPU
GDP

KGC23 Keynote: The Future of Knowledge Graphs in a World of LLMs — Denny Vrandečić, Wikimedia