ChatGPT Health Identified Respiratory Failure. Then It Said Wait.

TL;DR

Mount Sinai’s ChatGPT Health study reported repeatable triage errors, including under-triaging respiratory failure by recommending waiting 24–48 hours for ER care.

Briefing Cornell Notes

Briefing

A Mount Sinai Health System study on “ChatGPT Health” found a dangerous pattern: the system sometimes recommends waiting for urgent conditions and sometimes recommends doctor visits for problems that should be handled in an emergency setting. The most alarming example is respiratory failure—where the tool advised waiting 24 to 48 hours before going to the ER, even though immediate care is the correct response. The failure isn’t portrayed as a one-off glitch; it appears as a repeatable behavior that can translate into real-world harm, including death.

Rather than treating the medical case as an isolated defect, the discussion reframes it as a roadmap for how AI agents fail in high-stakes environments. Four failure modes emerge from breaking down the study’s results, and none are limited to healthcare. First is an “inverted U” performance shape: systems often handle textbook emergencies and clearly minor cases well, but stumble at the edges where presentations are ambiguous—exactly where clinical stakes tend to be highest. Average accuracy can look strong while tail-risk failures quietly persist, because routine cases dominate the data and the evaluation metrics.

Second is “knows but does not act.” In the respiratory failure example, the reasoning chain can correctly identify dangerous findings while the final recommendation contradicts that conclusion. Research on chain-of-thought faithfulness is cited to argue that explanations and outputs can behave like semi-independent processes—meaning the visible reasoning may not reliably predict the action the system takes. A related warning: if compliance or customer-service agents miss the mismatch between what they “think” and what they “do,” dangerous errors can slip past.

Third is “social context hijacks judgment.” When unstructured human language frames a situation—such as a family member minimizing symptoms—triage recommendations shift. One reported effect: recommendations become far more likely to be less urgent when the input includes cues that the patient “looks fine,” described as anchoring bias. This kind of bias can be subtle, systematically defensible in isolation, and invisible in standard evaluations that don’t test the same scenario with and without the framing.

Fourth is guardrails that fire on “vibes,” not risk. Crisis-intervention alerts were reported to trigger more reliably on vague emotional distress than on concrete self-harm threats, with alerts inverted relative to clinical risk. The core issue is confusing surface-level language patterns with the actual risk taxonomy clinicians use.

To address these failure modes, the discussion proposes a four-layer evaluation and deployment architecture: progressive autonomy (human oversight for edge cases), deterministic validation (rules that verify key consistency between reasoning and action), a risk-biased “flywheel” that favors false positives while continuously updating what counts as a true defect, and factorial stress testing (controlled variation across contextual factors to measure which inputs drive behavior). The broader takeaway is that high-stakes agents require infrastructure to find blind spots before users pay the price—an expectation framed as increasingly unavoidable, including via future “AI insurance” requirements for agent builders.

Cornell Notes

Mount Sinai’s study of “ChatGPT Health” found repeatable safety failures: it sometimes under-triages urgent conditions (e.g., advising waiting 24–48 hours for respiratory failure) and sometimes over-triages non-emergencies. The analysis turns those findings into four general agent failure modes: an “inverted U” where edge cases fail despite good average accuracy, “knows but does not act” where reasoning and final actions diverge, social-context anchoring where framing shifts recommendations, and guardrails that trigger on surface cues rather than true clinical risk. The proposed fix is an evaluation architecture combining progressive autonomy, deterministic validation, a false-positive–biased continuous review flywheel, and factorial stress testing using controlled context variations. The stakes are life-or-death in healthcare and still high in finance, compliance, and customer operations.

Why does an “inverted U” matter for safety-critical agents even when overall accuracy looks high?

The pattern described is that performance is strongest for clear, textbook cases and for clearly minor cases, while failures concentrate at the distribution’s edges—where inputs are ambiguous or structurally different from routine examples. In production, LLM-based systems are trained on data where the middle is dense and extremes are sparse, so they look good on aggregate metrics (e.g., “87% accuracy”) while tail-risk errors remain hidden. The practical warning is that average-based evals won’t reliably surface failures that occur in the rare but consequential cases—like an accounts payable agent missing a slightly modified duplicate invoice or a claims agent failing to detect a third related claim after many months.

What does “knows but does not act” mean, and why can chain-of-thought be misleading?

In the Mount Sinai findings, the reasoning chain can correctly identify dangerous clinical findings, yet the final recommendation contradicts that conclusion—such as recognizing early respiratory failure while advising to wait rather than go to the ER. The discussion links this to research on chain-of-thought faithfulness: the explanation and the final output can behave like semi-independent processes. Even if the reasoning text is anchored early, the model may not update the output when later reasoning changes; studies also suggest that inserting incorrect reasoning can still yield correct answers sometimes. The takeaway: explanations are not a dependable proxy for the action the agent will take, so teams must validate the consistency between what the agent “reasons” and what it ultimately recommends.

How can social context inputs change an agent’s decision without making the case obviously wrong?

The study’s reported effect is anchoring bias: when a family member minimizes symptoms (e.g., “the patient looks fine”), triage recommendations shift toward less urgent care. The discussion generalizes this to any agent that mixes structured fields with unstructured language, because framing can subtly steer the output while remaining individually defensible. It cites a concrete example outside medicine: a vendor selection recommendation can change when a note from a senior VP adds confidence, even if the structured facts are unchanged. Standard evaluations that don’t test the same scenario with and without the framing cue may never detect the bias.

What is “guardrails that fire on vibes,” and how does it differ from testing for safety?

The guardrails issue is that alerts trigger based on surface-level language patterns—emotional tone, vague distress keywords—rather than the actual risk taxonomy used by clinicians. In the self-harm crisis-intervention case, alerts were described as inverted relative to clinical risk: vague emotional distress triggered more reliably than concrete self-harm threats. The distinction is between “appearance of safety” and real safety: a system may claim compliance or risk mitigation because it matches keywords, while missing the true high-risk scenario (e.g., flagging an email labeled “confidential” even when it’s a public press release, while not flagging an export of sensitive customer records described as a benign backup).

How do the proposed four layers of evaluation reduce these specific failure modes?

Layer 1 uses progressive autonomy: trust the agent for high-confidence, low-stakes decisions, but keep humans in the loop for extreme edge cases; one method is shadow mode where the human makes the call while the agent runs in parallel to measure agreement over time. Layer 2 adds deterministic validation: rules-based checks verify obvious consistency conditions (e.g., if enhanced due diligence flags appear in reasoning but output classifies as standard risk, escalate). Layer 3 builds a continuous flywheel biased toward false positives: review positives to distinguish true vs false alarms and update eval sets; also re-check cases the judge accepted to catch false negatives. Layer 4 uses factorial stress testing: controlled variations (social cues, time pressure, hedging qualifiers, tool-call failures) reveal which factors systematically shift behavior.

Review Questions

Which evaluation metric can hide “inverted U” failures, and what tail-risk behavior does it miss?
Give one example of how chain-of-thought faithfulness problems could cause a mismatch between reasoning and final action.
Why does factorial stress testing require controlled context variation, and what failure mode does it most directly target?

Key Points

1
Mount Sinai’s ChatGPT Health study reported repeatable triage errors, including under-triaging respiratory failure by recommending waiting 24–48 hours for ER care.
2
Safety failures cluster at the edges of input distributions, so average accuracy can mask tail-risk errors where stakes are highest.
3
Reasoning text is not a reliable predictor of the final action; explanations and outputs can diverge, creating “knows but does not act” failures.
4
Unstructured framing can anchor agent decisions, shifting recommendations even when structured facts remain unchanged.
5
Guardrails can be inverted when they trigger on surface cues (keywords, emotional tone) rather than the true risk taxonomy.
6
A four-layer approach—progressive autonomy, deterministic validation, a false-positive–biased continuous review flywheel, and factorial stress testing—targets these failure modes directly.

Highlights

Respiratory failure was used as a concrete example where ChatGPT Health recommended waiting 24–48 hours instead of immediate ER care.

The analysis emphasizes an “inverted U” pattern: strong performance in textbook cases and weak performance at ambiguous extremes.

Chain-of-thought faithfulness is treated as structurally unreliable, so teams must validate the match between reasoning and actions.

Factorial stress testing with controlled contextual variations is presented as a gold-standard method to expose anchoring, guardrail inversion, and edge-case failures.

Topics

AI Agent Safety
Medical Triage
Evaluation Design
Chain-of-Thought Faithfulness
Factorial Stress Testing

Mentioned

Nate B Jones