Get AI summaries of any video or article — Sign up free
Building a Healthcare & Life Sciences Knowledge Graph with Synthea & ChatGPT — Franz, Inc | KGC 2023 thumbnail

Building a Healthcare & Life Sciences Knowledge Graph with Synthea & ChatGPT — Franz, Inc | KGC 2023

5 min read

Based on The Knowledge Graph Conference 's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Hospitals adopt knowledge graphs faster when they start with a plug-and-play graph plus ready-made notebooks and example queries rather than building from scratch.

Briefing

Healthcare and life-sciences teams can move from “ask-and-forget” chat answers to usable, queryable intelligence by pairing a hospital-ready knowledge graph with large language models—then validating and correcting outputs with web search and structured prompts. The core pitch is a plug-and-play healthcare knowledge graph built for hospitals, plus a set of workflows that let ChatGPT-like systems extract causal chains, generate clinical-style notes from patient context, and populate a graph with evidence-backed links.

The talk frames a practical bottleneck: hospitals rarely build knowledge graphs because the work is too hard to start and too risky to deploy. To address that, the team created a plug-and-play knowledge graph preloaded with realistic healthcare data and a library of example questions and methods, delivered in a way that can be explored quickly via notebooks. Under the hood is an “entity-event” knowledge graph model: a patient is the core entity, and everything that happens over time—demographics, encounters, procedures, observations, claims-related events—becomes typed event objects with start/end times and key-value attributes. Those events are organized and terminated in taxonomies so the graph can support consistent querying and downstream analytics.

The knowledge graph’s value proposition is operational. Queries become far simpler to write and interpret, and retrieving “everything about a person” stops being a scavenger hunt across many databases. The talk also emphasizes security and governance: because personal health information can’t be freely shared, the team builds the graph using Synthea, a synthetic patient dataset, as a safe stand-in that hospitals can demonstrate internally to leadership and colleagues.

Beyond structure, the graph is designed to integrate unstructured medical text—an important point given that most medical data is not neatly formatted. The system links clinical notes and literature by mapping entities into UMLS, then uses entity extraction (including BioWordVec-based extraction) and an internal UMLS mapping approach to connect PubMed articles, clinical-trial text, and Synthea’s structured outputs into one system. That enables cross-domain queries such as linking patients to relevant clinical trials, or connecting medical records to conditions and evidence in the literature.

Large language models enter as both a productivity layer and a reasoning layer, but with heavy emphasis on validation. For causal-chain extraction, the workflow prompts an LLM to produce structured outputs (JSON or Python lists) and then checks whether the referenced links actually exist; the speaker notes that naive generation can fabricate or break half the links, and that web search plus correction can repair many of those failures. Another use case uses the graph plus prompting to generate narrative clinical reports from patient context (last conditions, procedures, medications, observations, and encounters), producing clinician-style notes and recommendations—while acknowledging that some outputs may be artificial and still require human oversight.

Finally, the talk extends the same pattern to a wine knowledge graph built automatically from an ontology and then enriched with LLM-generated descriptions, taste profiles, and (optionally) web-verified information. The closing message is a workflow principle: intelligent applications work best as a collaboration between humans, an LLM, and a knowledge graph—because the graph stores structured knowledge that can be queried repeatedly, while LLM answers alone are fleeting and harder to audit.

Cornell Notes

A hospital-ready knowledge graph can make healthcare questions easier, safer, and more reusable when it’s paired with large language models. The system uses an “entity-event” model: patients are core entities, and time-stamped events (demographics, encounters, procedures, observations, claims-related data) are attached to them and organized through taxonomies. To handle privacy and adoption barriers, it starts with Synthea synthetic data and provides plug-and-play notebooks plus a library of queries for common hospital workflows. Unstructured text is integrated by mapping entities into UMLS and extracting entities from clinical and literature sources, enabling cross-domain queries like patient-to-trial matching. LLMs then generate structured outputs (e.g., causal chains or clinical-style notes), but web search and validation steps are used to reduce fabricated links and other errors.

Why does an “entity-event” knowledge graph matter for healthcare use cases?

It turns the messy reality of healthcare timelines into a consistent structure: one core entity (the patient) and many event objects that describe what happens over time. Each event carries a type plus start/end times and key-value attributes, and the events are logically grouped and terminated in taxonomies. That structure makes it easier to query “everything about a patient” and to support tasks like event prediction, feature extraction, and reporting without stitching together thousands of records across separate databases.

How does the approach handle privacy when hospitals can’t share personal health information?

Instead of building the demonstration graph directly from protected data, it uses Synthea to create a realistic synthetic dataset that can be loaded into the knowledge graph. That lets hospitals safely show leadership and colleagues what a knowledge graph enables—simpler queries, one-click retrieval, and structured JSON outputs—while keeping personal health information protected.

What’s the strategy for connecting unstructured medical text to the graph?

The system links unstructured sources (clinical notes and literature) by mapping entities into UMLS and using entity extraction to connect text to graph concepts. It integrates PubMed articles and clinical-trial text alongside Synthea’s linked content, so the resulting graph can connect patients, conditions, and evidence across structured and unstructured data in one system.

How are causal chains extracted from LLMs without relying on free-form text?

The workflow uses structured prompting that forces outputs into a specific format such as a JSON object or a Python list, rather than narrative text. It then validates the generated references: the speaker reports that naive generation can fabricate or produce broken links, and that adding web search to verify and correct those links improves reliability. The validated causal chain can then be represented in the graph with link strengths and attached references.

What does “LLM + knowledge graph” look like in a clinical-note generation workflow?

A Sparkle query pulls patient-specific context from the graph (e.g., last five conditions, procedures, medications, observations, and encounters). A prompt then asks the LLM to produce a detailed clinical report in narrative or structured style, including observations, recommendations, labs/ER visits, prescriptions, and follow-ups. The speaker notes that outputs may include recommendations not explicitly present in the original observations, so human review remains important.

What’s the wine-graph example meant to demonstrate beyond healthcare?

It shows the same pattern—ontology + knowledge graph + structured prompting—can automatically generate and enrich domain knowledge. The wine graph is built from an ontology and then queried to retrieve grapes, taste profiles, and other attributes; LLM prompting can generate structured descriptions (and optionally “fantasy” reviews when real reviews aren’t available). The example also highlights the need to constrain outputs to predicates and validate anything that depends on web information.

Review Questions

  1. How does the entity-event model change the way healthcare questions are written and answered compared with querying many separate databases?
  2. What validation steps are used to reduce fabricated or broken references when extracting causal chains with LLMs?
  3. Why is UMLS mapping central to connecting unstructured clinical and literature text to a single knowledge graph?

Key Points

  1. 1

    Hospitals adopt knowledge graphs faster when they start with a plug-and-play graph plus ready-made notebooks and example queries rather than building from scratch.

  2. 2

    An entity-event knowledge graph models healthcare timelines by attaching time-stamped event objects to a patient entity and organizing them with taxonomies.

  3. 3

    Synthea enables safe demonstrations of knowledge-graph capabilities without exposing protected personal health information.

  4. 4

    UMLS mapping and entity extraction connect unstructured clinical notes and literature (e.g., PubMed and clinical-trial text) to the same graph concepts as structured data.

  5. 5

    LLMs work best when prompted for structured outputs (JSON/Python lists) and paired with web search and validation to limit fabricated links.

  6. 6

    Clinical-style narratives can be generated by pulling recent patient context from the graph and prompting an LLM to produce recommendations and follow-ups, but human oversight remains essential.

  7. 7

    A knowledge graph turns LLM answers into reusable, auditable knowledge by storing structured facts that can be queried repeatedly.

Highlights

The “entity-event” model treats every patient-related happening—demographics, encounters, procedures, observations—as typed, time-bounded events attached to a single patient entity.
Synthea is used as a privacy-preserving way to let hospitals demonstrate knowledge-graph value without sharing protected health data.
Naive LLM causal-chain generation can fabricate or break many links, and web search plus correction is used to improve reference integrity.
Structured prompting plus graph context enables LLMs to generate clinician-style reports based on the last conditions, procedures, medications, observations, and encounters pulled from the graph.

Topics

  • Healthcare Knowledge Graph
  • Entity-Event Modeling
  • UMLS Integration
  • LLM Prompting
  • Causal Chain Extraction

Mentioned

  • UMLS
  • LLMs
  • JSON
  • ICU
  • CMS
  • ER
  • UMLS
  • PMET
  • GPT
  • Synthea
  • Sparkle