Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation

Anjanava Biswas, Wrick Talukdar

International Journal of Innovative Science and Research Technology (IJISRT)·2024·Computer Science·1,012 citations

6 min read

Read the full paper at DOI or on arxiv

TL;DR

The paper targets automated, structured clinical note generation (SOAP and BIRP) from patient–clinician conversations to reduce documentation burden and improve consistency.

Briefing Cornell Notes

Briefing

This paper asks whether generative AI can reduce the documentation burden on clinicians by automatically producing structured clinical notes from patient–clinician interactions, specifically in SOAP (Subjective, Objective, Assessment, Plan) and BIRP (Behavior, Intervention, Response, Plan) formats. The motivation is practical and urgent: clinical documentation is widely reported to consume a large fraction of clinicians’ working time, contributing to burnout, medical errors, and downstream patient-safety risks. The authors position generative AI as a way to shift clinicians’ effort away from administrative writing and toward direct patient care, while still producing documentation that is structured enough to support clinical workflows.

The significance of the work is twofold. First, it targets a concrete, high-friction task—turning conversational encounters into standardized note templates—rather than generic text generation. Second, it emphasizes patient-centered documentation by preserving the “subjective” or “behavioral” content from the patient while organizing clinical reasoning and next steps into consistent sections. In the broader context of health NLP, the paper aligns with ongoing efforts to automate or assist clinical note writing, but it also acknowledges known failure modes from prior studies, such as omissions, variability in quality, and reliability gaps when using large language models (LLMs) without careful prompting and structure constraints.

Methodologically, the paper is presented as a case study and system design exercise rather than a controlled clinical evaluation. The pipeline includes: (1) data collection from publicly available synthetic/educational therapy sessions, (2) speech-to-text transcription using automatic speech recognition (ASR), (3) speaker diarization to separate patient vs. clinician utterances, and (4) LLM-based generation of SOAP/BIRP notes using prompting strategies. For data, the authors collaborate with a University of Leeds researcher/clinical psychologist (Judith Johnson) and use experimental therapy sessions posted on YouTube. They state that the videos do not contain personally identifiable information (PII) or protected health information (PHI), and they treat the sessions as realistic examples of patient–clinician dialogue.

For transcription, the authors use OpenAI Whisper and note a key limitation: Whisper does not provide speaker diarization out of the box. They therefore test two diarization approaches. The first uses an “alternate insanely-fast-whisper” setup with built-in support for plugging in a diarization model (pyannote/speaker-diarization), but they report failure to achieve successful diarization suitable for note generation. The second approach uses GPT-3.5 for utterance classification: after Whisper produces a plain transcript, GPT-3.5 labels each utterance as patient or clinician. They describe this as a binary sequence labeling problem with softmax-normalized class probabilities and cross-entropy loss. They evaluate diarization using accuracy, precision, recall, F1-score, and confusion matrices (figures are provided), but the paper does not report explicit numeric diarization performance in the provided text.

The core generation component evaluates multiple LLMs for SOAP/BIRP note generation using diarized transcripts: GPT-3.5 Turbo, GPT-4 Turbo, Claude V3, Mixtral 8x7B Instruct, and Llama-3 70B Instruct. Prompting is varied along two main axes: basic prompting (direct instructions plus transcript) versus advanced prompting (zero-shot and one-shot examples, plus structured prompting using JSON schema). The authors also explore prompt refinement strategies such as iterative refinement, prompt chaining, and prompt ensembling. For GPT-4 Turbo, they mention using a function-calling feature to produce programmatically structured SOAP/BIRP notes in JSON format, which supports downstream validation and consumption.

For evaluation, the paper reports a comparative analysis across 20 SOAP and BIRP note samples, each graded for quality by humans and assessed using ROUGE-1 F1 scores. The most explicit quantitative results concern ROUGE-1 F1 ranges by model. For SOAP notes, GPT-4 achieves ROUGE-1 F1 scores between 0.90 and 0.95, described as consistently superior. Claude and Llama show similar performance with ROUGE-1 F1 fluctuating between 0.70 and 0.80. Mixtral ranges between 0.65 and 0.75, described as the least performant but still “reasonably accurate” with lower consistency and precision. For BIRP notes, the paper includes a corresponding figure (fig. 6) but, in the provided excerpt, does not list the exact numeric ranges in text beyond the SOAP discussion.

Beyond one-shot generation, the authors propose an “iterative note improvement” mechanism to keep notes aligned with evolving patient care. The idea is to update existing SOAP/BIRP notes using new encounter data (audio/transcripts/documents) via conditional note generation (provide the prior note plus new transcript and ask for an updated note) or iterative refinement (extract relevant new information first, then integrate it into the existing note). They also discuss continuous adaptation across multiple encounters, version control/auditing, and timestamping of note revisions.

Limitations are acknowledged primarily in the form of challenges and risks rather than as a formal experimental limitation section. The paper highlights: dependence on data quality and representativeness (risk of bias), privacy/security concerns (ensuring generated notes do not leak PII/PHI), lack of interpretability/transparency (difficulty understanding model rationale), reliability/robustness issues (hallucinations and factual errors), regulatory compliance and liability, and the necessity of human oversight and validation. Additionally, from the methodology described, the study does not appear to include a rigorous clinical outcome evaluation (e.g., whether generated notes improve diagnosis accuracy, reduce errors, or affect patient outcomes). It also relies on educational/simulated therapy sessions rather than real clinical encounters, which may limit generalizability.

Practically, the results suggest that structured prompting—especially with JSON schema and strong models like GPT-4—can produce higher-quality SOAP/BIRP drafts from diarized transcripts, potentially saving clinician time and improving documentation consistency. Who should care includes healthcare organizations exploring clinical documentation automation, developers building health NLP pipelines (ASR + diarization + LLM note generation), and clinical informatics teams responsible for governance, privacy, and workflow integration. The paper’s emphasis on iterative updates and auditing implies that any deployment would need robust human-in-the-loop review and careful system design to ensure safety and compliance.

Cornell Notes

The paper proposes a generative-AI pipeline that transcribes patient–clinician conversations, diarizes speakers, and uses LLM prompting (including JSON schema) to generate structured SOAP and BIRP clinical notes. In a case-study evaluation using ROUGE-1 F1 on 20 note samples, GPT-4 produced the highest-quality outputs (ROUGE-1 F1 between 0.90 and 0.95), while other models scored lower.

What research question does the paper address?

Can generative AI streamline clinical documentation by generating structured SOAP and BIRP notes from patient–clinician interactions while improving quality and reducing clinician burden?

Why does the paper matter for healthcare delivery?

Clinical documentation time contributes to burnout and can increase medical errors; automating note drafting could free clinician time and improve consistency, but must be done safely with privacy and bias controls.

What study design or evaluation approach is used?

A case-study/system design approach: build an end-to-end pipeline (ASR → diarization → LLM note generation) and compare model outputs using human-graded note samples and ROUGE-1 F1.

What data source is used for patient–clinician interactions?

Publicly available educational therapy sessions from a University of Leeds researcher (Judith Johnson) on YouTube, described as containing no PII/PHI.

How is diarization handled, and what happens with the first diarization approach?

Whisper transcription is followed by speaker diarization. The authors test an insanely-fast-whisper + pyannote diarization plug-in but report unsuccessful diarization suitable for note generation.

What diarization method ultimately works in the pipeline?

Utterance classification: GPT-3.5 labels each Whisper utterance as patient or clinician using a binary sequence labeling formulation.

Which LLMs are evaluated for SOAP/BIRP generation?

GPT-3.5 Turbo, GPT-4 Turbo, Claude V3, Mixtral 8x7B Instruct, and Llama-3 70B Instruct.

What prompting strategies are compared?

Basic prompting versus advanced prompting, including zero-shot and one-shot prompting, plus structured prompting using JSON schema; they also explore iterative refinement, prompt chaining, and prompt ensembling.

What is the primary quantitative result reported for model quality?

Using ROUGE-1 F1 on 20 SOAP/BIRP samples: GPT-4 achieves ROUGE-1 F1 between 0.90 and 0.95, while Claude and Llama fluctuate between 0.70 and 0.80 and Mixtral between 0.65 and 0.75.

How does the paper propose keeping notes up to date over time?

An iterative note improvement approach that updates existing SOAP/BIRP notes using new encounter transcripts/documents via conditional note generation or a two-step extract-then-integrate refinement process, with version control/auditing.

Review Questions

Reconstruct the end-to-end pipeline from audio/video to structured SOAP/BIRP output, including where diarization occurs and why it is necessary.
Compare basic prompting vs. advanced prompting (zero-shot/one-shot vs. JSON schema). What does structured prompting buy you in this system?
Interpret the ROUGE-1 F1 ranges: what do they imply about model robustness and consistency across note complexity?
List at least four deployment risks the authors emphasize (privacy, hallucinations, bias, interpretability, regulatory/liability) and explain how human oversight fits into mitigation.

Key Points

1
The paper targets automated, structured clinical note generation (SOAP and BIRP) from patient–clinician conversations to reduce documentation burden and improve consistency.
2
An end-to-end pipeline is proposed: Whisper transcription, speaker diarization (ultimately via GPT-3.5 utterance classification), and LLM-based note drafting.
3
The authors report that a diarization approach using an insanely-fast-whisper setup with pyannote diarization failed to produce usable diarization for note generation.
4
For note generation quality (human-graded samples evaluated with ROUGE-1 F1), GPT-4 is best for SOAP notes with ROUGE-1 F1 between 0.90 and 0.95; Claude and Llama are lower (0.70–0.80) and Mixtral lowest (0.65–0.75).
5
Advanced prompting—including JSON schema structured prompting and (for GPT-4 Turbo) function calling to emit JSON—aims to improve adherence to note structure.
6
The paper proposes iterative note updating across encounters using conditional update or extract-then-integrate refinement, plus version control/auditing.
7
Key limitations and risks are discussed: data representativeness/bias, privacy/security, interpretability, hallucination/reliability, regulatory compliance and liability, and the need for human-in-the-loop validation.

Highlights

“GPT-4 consistently achieves superior performance, with ROUGE-1 F1 scores ranging from 0.90 to 0.95.”

“Claude and Llama exhibit similar performance levels, with ROUGE-1 F1 scores fluctuating between 0.70 and 0.80.”

“Mixtral… maintains competitive ROUGE-1 F1 scores between 0.65 and 0.75.”

The paper states that the first diarization attempt (“Alternate Insanely-Fast-Whisper… with… pyannotate/speaker-diarization”) “failed to achieve any significant and successful diarization… [the] result [was] not appropriate for notes generation.”

The authors propose iterative updating: “Instead of generating entirely new notes from scratch, the LLMs can be prompted to update and refine the existing notes iteratively, incorporating the new information from subsequent encounters.”

Topics

Clinical NLP
Generative AI for healthcare documentation
Speech recognition and ASR
Speaker diarization
Prompt engineering for LLMs
Structured generation and JSON schema validation
Human-in-the-loop systems
Healthcare AI safety, privacy, and bias

Mentioned

OpenAI Whisper
OpenAI GPT-3.5
OpenAI GPT-4 Turbo
Anthropic Claude V3
Llama-3 70B Instruct
Mixtral 8x7B Instruct
pyannote/speaker-diarization
ROUGE (ROUGE-1)
HELM (Holistic Evaluation of Language Models)
MMLU
MedQA
NarrativeQA
Python SDK / API calls (cloud model access)
Anjanava Biswas
Wrick Talukdar
Judith Johnson
Alec Radford
Jong Kim
Tao Xu
Greg Brockman
Christine McLeavey
Ilya Sutskever
Annessa Kernberg
Jeffrey Gold
Vishnu Mohan
Adam Rule
Sarah Florig
Steven Bedrick
Jeffrey Gold
Michelle Hribar
Dan Hendrycks
Collin Burns
Collin Basart
Andy Zou
Mantas Mazeika
Dawn Song
Jacob Steinhardt
Chin-Yew Lin
SOAP - Subjective, Objective, Assessment, Plan
BIRP - Behavior, Intervention, Response, Plan
NLP - Natural Language Processing
ASR - Automatic Speech Recognition
LLM - Large Language Model
PII - Personally Identifiable Information
PHI - Protected Health Information
JSON - JavaScript Object Notation
ROUGE - Recall-Oriented Understudy for Gisting Evaluation
F1 - F1-score
XAI - Explainable AI
RCT - Randomized Controlled Trial