Building a Healthcare & Life Sciences Knowledge Graph with Synthea & ChatGPT — Franz, Inc | KGC 2023
Based on The Knowledge Graph Conference 's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Hospitals adopt knowledge graphs faster when they start with a plug-and-play graph plus ready-made notebooks and example queries rather than building from scratch.
Briefing
Healthcare and life-sciences teams can move from “ask-and-forget” chat answers to usable, queryable intelligence by pairing a hospital-ready knowledge graph with large language models—then validating and correcting outputs with web search and structured prompts. The core pitch is a plug-and-play healthcare knowledge graph built for hospitals, plus a set of workflows that let ChatGPT-like systems extract causal chains, generate clinical-style notes from patient context, and populate a graph with evidence-backed links.
The talk frames a practical bottleneck: hospitals rarely build knowledge graphs because the work is too hard to start and too risky to deploy. To address that, the team created a plug-and-play knowledge graph preloaded with realistic healthcare data and a library of example questions and methods, delivered in a way that can be explored quickly via notebooks. Under the hood is an “entity-event” knowledge graph model: a patient is the core entity, and everything that happens over time—demographics, encounters, procedures, observations, claims-related events—becomes typed event objects with start/end times and key-value attributes. Those events are organized and terminated in taxonomies so the graph can support consistent querying and downstream analytics.
The knowledge graph’s value proposition is operational. Queries become far simpler to write and interpret, and retrieving “everything about a person” stops being a scavenger hunt across many databases. The talk also emphasizes security and governance: because personal health information can’t be freely shared, the team builds the graph using Synthea, a synthetic patient dataset, as a safe stand-in that hospitals can demonstrate internally to leadership and colleagues.
Beyond structure, the graph is designed to integrate unstructured medical text—an important point given that most medical data is not neatly formatted. The system links clinical notes and literature by mapping entities into UMLS, then uses entity extraction (including BioWordVec-based extraction) and an internal UMLS mapping approach to connect PubMed articles, clinical-trial text, and Synthea’s structured outputs into one system. That enables cross-domain queries such as linking patients to relevant clinical trials, or connecting medical records to conditions and evidence in the literature.
Large language models enter as both a productivity layer and a reasoning layer, but with heavy emphasis on validation. For causal-chain extraction, the workflow prompts an LLM to produce structured outputs (JSON or Python lists) and then checks whether the referenced links actually exist; the speaker notes that naive generation can fabricate or break half the links, and that web search plus correction can repair many of those failures. Another use case uses the graph plus prompting to generate narrative clinical reports from patient context (last conditions, procedures, medications, observations, and encounters), producing clinician-style notes and recommendations—while acknowledging that some outputs may be artificial and still require human oversight.
Finally, the talk extends the same pattern to a wine knowledge graph built automatically from an ontology and then enriched with LLM-generated descriptions, taste profiles, and (optionally) web-verified information. The closing message is a workflow principle: intelligent applications work best as a collaboration between humans, an LLM, and a knowledge graph—because the graph stores structured knowledge that can be queried repeatedly, while LLM answers alone are fleeting and harder to audit.
Cornell Notes
A hospital-ready knowledge graph can make healthcare questions easier, safer, and more reusable when it’s paired with large language models. The system uses an “entity-event” model: patients are core entities, and time-stamped events (demographics, encounters, procedures, observations, claims-related data) are attached to them and organized through taxonomies. To handle privacy and adoption barriers, it starts with Synthea synthetic data and provides plug-and-play notebooks plus a library of queries for common hospital workflows. Unstructured text is integrated by mapping entities into UMLS and extracting entities from clinical and literature sources, enabling cross-domain queries like patient-to-trial matching. LLMs then generate structured outputs (e.g., causal chains or clinical-style notes), but web search and validation steps are used to reduce fabricated links and other errors.
Why does an “entity-event” knowledge graph matter for healthcare use cases?
How does the approach handle privacy when hospitals can’t share personal health information?
What’s the strategy for connecting unstructured medical text to the graph?
How are causal chains extracted from LLMs without relying on free-form text?
What does “LLM + knowledge graph” look like in a clinical-note generation workflow?
What’s the wine-graph example meant to demonstrate beyond healthcare?
Review Questions
- How does the entity-event model change the way healthcare questions are written and answered compared with querying many separate databases?
- What validation steps are used to reduce fabricated or broken references when extracting causal chains with LLMs?
- Why is UMLS mapping central to connecting unstructured clinical and literature text to a single knowledge graph?
Key Points
- 1
Hospitals adopt knowledge graphs faster when they start with a plug-and-play graph plus ready-made notebooks and example queries rather than building from scratch.
- 2
An entity-event knowledge graph models healthcare timelines by attaching time-stamped event objects to a patient entity and organizing them with taxonomies.
- 3
Synthea enables safe demonstrations of knowledge-graph capabilities without exposing protected personal health information.
- 4
UMLS mapping and entity extraction connect unstructured clinical notes and literature (e.g., PubMed and clinical-trial text) to the same graph concepts as structured data.
- 5
LLMs work best when prompted for structured outputs (JSON/Python lists) and paired with web search and validation to limit fabricated links.
- 6
Clinical-style narratives can be generated by pulling recent patient context from the graph and prompting an LLM to produce recommendations and follow-ups, but human oversight remains essential.
- 7
A knowledge graph turns LLM answers into reusable, auditable knowledge by storing structured facts that can be queried repeatedly.