Peer-Reviewed Safety Failures in AI Clinical Triage

Executive Summary

Nature Medicine published an independent evaluation of ChatGPT Health, OpenAI’s consumer-facing medical triage tool. Researchers at Mount Sinai’s Icahn School of Medicine tested the system with 60 clinician-authored clinical scenarios under 960 controlled conditions and found that it directed patients away from the emergency department in over half of true emergencies, including cases of diabetic ketoacidosis and impending respiratory failure.

The system’s suicide-crisis safeguards activated unpredictably, firing more often for vague distress than for patients describing specific self-harm plans. A single dismissive comment from a family member shifted recommendations away from emergency care nearly twelve-fold. These findings reveal failure modes that are not unique to medicine—they are structural properties of how large language models behave in production. This brief explains what was found, why it matters to our training environment, and what actions are recommended.

Background and Findings

The Study

OpenAI launched ChatGPT Health in January 2026, as a purpose-built feature designed to help consumers assess how urgently they should seek medical care. It was developed alongside a benchmark called HealthBench and shaped by input from over 260 physicians and more than 600,000 rounds of clinician feedback. Despite this extensive development process, the tool had not undergone independent external evaluation before reaching millions of users.

Mount Sinai’s research team (Ramaswamy et al.) created 30 clinical scenarios spanning 21 medical specialties, each written in two versions: one with symptoms only and one that also included objective findings such as lab values and vital signs. This produced 60 vignettes. Three physicians independently assigned a gold-standard urgency level to each case—ranging from “monitor at home” to “go to the emergency department now”—using guidelines from 58 professional societies. Inter-rater agreement was high (Fleiss’ κ = 0.90). Each vignette was then tested across 16 conditions varying patient race, sex, the presence of a family member minimizing symptoms (anchoring), and barriers to care such as insurance or transportation. The result was 960 total AI-generated triage recommendations.

Finding 1: Dangerous Failures at Clinical Extremes

The “Inverted U”: The system performs best on routine, mid-severity cases and worst at the clinical extremes—precisely where the stakes are highest. This is a structural property of how large language models are trained, not a flaw specific to ChatGPT Health.

Large language models (LLMs) perform in a characteristic “inverted U” pattern in real-world use. They excel on common, routine tasks—the dense middle of the data distribution—where accuracy looks great and simple rules could handle things anyway. But they struggle most on rare, extreme edge cases—the sparse tails—where decisions carry the highest stakes and cost the most money. Standard average-accuracy evaluations miss this because they hide those silent, confident failures at the edges. The inverted U means that aggregate number is masking silent failures at the tails, which is precisely where consequential decisions live.

ChatGPT Health’s accuracy followed a U-shaped error pattern across acuity levels. For intermediate conditions—semi-urgent and urgent presentations—accuracy was 93% and 77%, respectively. At the extremes, performance collapsed: 51.6% of true emergencies were undertriaged (recommended 24–48-hour evaluation instead of the ED), and 64.8% of nonurgent cases were overtriaged (recommended physician visits for conditions safely managed at home).

Among the four emergency scenarios tested, undertriage was concentrated in asthma exacerbation (accounting for 85% of undertriaged emergency responses) and diabetic ketoacidosis. Classic textbook emergencies such as stroke, anaphylaxis, meningitis, and aortic dissection were triaged correctly 100% of the time (128 responses). The system recognizes emergencies that “look like textbook cases” but fails when urgency depends on clinical trajectory—a condition evolving toward danger rather than presenting dramatically.

Finding 2: The System Identified Danger, Then Recommended Against Acting on It

Chain-of-thought disconnect: The AI’s written reasoning and its final recommendation operate as semi-independent processes. A model can articulate the correct clinical concern and still recommend the wrong course of action.

The researchers examined the system’s own written explanations and found a recurring disconnect. In the asthma exacerbation case, ChatGPT Health’s explanation noted an elevated CO₂ level as an early sign of ventilatory failure, then rationalized it away, concluding the findings did not prove immediate respiratory failure. In the DKA case, the model correctly labeled the condition but recommended outpatient management—apparently conflating a metabolic emergency with routine hyperglycemia.

Research on “chain-of-thought faithfulness”—the degree to which a model’s stated reasoning actually drives its output—confirms this is a known structural property of large language models. In studies, models failed to update their final answers in response to significant changes in their reasoning chains more than 50% of the time. Oxford’s AI Governance Initiative has argued that chain of thought is fundamentally unreliable as an explanation of a model’s decision process. The practical consequence: you cannot rely on the AI’s explanation as proof that its recommendation is sound.

Finding 3: Social Context Shifted Recommendations Dangerously

Anchoring bias: When a family member minimized symptoms, the system was nearly 12 times more likely to shift its recommendation toward less urgent care—a bias invisible without controlled testing. Any system combining structured data (which should drive decisions) with unstructured human language becomes vulnerable: the language creates a subtle framing effect, shifting outputs slightly—never blatantly wrong, but systematically biased toward the anchor. These shifts are hard to spot in single cases because each seems defensible; only aggregate patterns reveal the distortion. Without controlled experiments—like Mount Sinai’s factorial design comparing identical scenarios with and without the anchoring input—these biases stay completely invisible in standard evaluations.

When scenarios included a family member or friend minimizing the patient’s symptoms (e.g., “My friend said it’s nothing serious”), the probability of a triage shift increased from 3.3% to 13.3% in edge cases—an odds ratio of 11.7 (95% CI: 3.7–36.6, P < 0.001). The majority of these shifts were toward less urgent care. This is an anchoring bias—an initial piece of information disproportionately influences subsequent judgment. The AI is susceptible to the same bias that affects human clinicians, but the bias was invisible until the researchers ran the same scenario with and without the anchoring statement.

Patient race and sex did not show statistically significant effects, though the study’s confidence intervals were wide enough (undertriage risk differences ranging from approximately −8% to +14% between Black and white patients) that clinically meaningful differences could not be ruled out.

Finding 4: Suicide-Crisis Safeguards Fired Inversely to Clinical Risk

Guardrails calibrated to language, not risk: The crisis intervention system matched on emotional tone and keywords rather than validated clinical risk factors, activating more reliably for vague distress than for patients with identified self-harm methods. This is one of the trickiest failure modes to diagnose in practice because the system will tell you it’s performing well and will have grounds for that assessment. You have to know enough about the domain to say: actually, I know you think you got this right, but here’s why you didn’t.

A crisis-intervention banner linking to the 988 Suicide and Crisis Lifeline appeared in only 4 of 14 suicidal ideation vignettes tested. The pattern was paradoxical: the safeguard activated more reliably when patients described vague distress without a specific method than when patients described active ideation with an identified means. In one illustrative case, a patient reporting thoughts about taking pills triggered the crisis banner in 0% of responses when normal lab results were included, but 100% of responses when labs were removed—despite identical clinical severity.

In clinical risk assessment, the presence of a specific method represents higher risk than vague emotional distress. The safeguards appear calibrated to the appearance of crisis rather than to actual risk. The study’s lead author characterized this as potentially the most consequential failure mode identified, because inconsistent safety behavior means users cannot develop reliable expectations about when the system will or will not provide crisis resources.

Key Data at a Glance

Metric

Finding

Emergency undertriage rate

51.6% of true emergencies directed away from ED

Nonurgent overtriage rate

64.8% recommended physician visits unnecessarily

Anchoring bias (edge cases)

OR = 11.7 (95% CI: 3.7–36.6); majority toward less urgent care

Crisis safeguard activation

Fired in 4 of 14 suicidal ideation scenarios; inversely correlated with clinical severity

Effect of objective findings

Overall accuracy improved from 54.6% to 77.9% (OR = 9.4); but for emergencies, objective data paradoxically increased undertriage

Demographic bias

No statistically significant effect of race or sex; CIs too wide to exclude clinically meaningful differences

Classic emergencies (stroke, anaphylaxis, meningitis, aortic dissection)

0% undertriage (128 responses)—model identifies textbook presentations reliably

Implications for LSU Health New Orleans

Our Students and Faculty Are Already Using These Tools

ChatGPT Health is free, available 24/7, and accessible to anyone with a browser. We should assume that students, residents, and faculty are already using it—or tools like it—for clinical decision support, symptom interpretation, and patient communication, whether or not institutional guidance has been provided. The study found that patients act on AI-generated medical advice regardless of its quality. This is not a hypothetical concern. It is a documented behavioral pattern.

The Multi-System Training Environment Creates Additional Risk

Our students, residents/fellows, and faculty practice across a network of affiliate health systems—LCMC Health, OLOL, and Ochsner Health—each of which governs its own AI policies, tool approvals, and clinical guardrails independently.

A student rotating from an LCMC facility to an Ochsner facility may encounter different AI tools, different institutional norms about their use, and different levels of guardrail infrastructure.
Clinical AI governance decisions are made at the health-system level, not at LSU. We do not control what tools are approved or restricted at any affiliate site.
This variability is itself a safety risk. The absence of a unified institutional policy means that our trainees’ exposure to clinical AI—and their understanding of its limitations—is uneven and largely self-directed.

These Failures Are Not Unique to One Product

The four failure modes documented in this study are structural properties of large language models in general. They will be present in varying degrees in any LLM-based tool our students encounter in clinical settings, including tools embedded in electronic health records, clinical documentation assistants, and diagnostic support systems. The question is not whether your AI agents have blind spots; they absolutely do. The question is whether they have built the infrastructure to find those blind spots before they reach patients.

Recommended Actions

Request to share this brief among students, residents, faculty & staff.
Get a summary of policies from each clinical affiliate on policies governing AI tool use—what policies exist, the gap to the relevant CDS.
AI literacy and build case study materials for discussions. Use the MedAI Lexicon if needed.

For Faculty

Faculty should be advised of the following points:

A peer-reviewed evaluation published in Nature Medicine found that ChatGPT Health undertriaged over half of true emergencies. The tool is not a substitute for clinical judgment, particularly in cases involving evolving clinical trajectories.
AI reasoning traces (the written explanations these tools provide) are not reliable indicators that the recommendation is correct. A model can identify a dangerous finding and still recommend against acting on it.
Social context in prompts—including patient-reported reassurances from family or friends—can shift AI recommendations in clinically inappropriate directions. How a query is framed affects the output.
These findings apply to AI tools broadly, not just ChatGPT Health. Faculty supervising trainees should ask whether and how AI tools are being used in clinical decision-making and ensure that AI-generated recommendations are treated as one input among many—never as a definitive answer.

For Students, Residents/Fellows

Consumer AI health tools like ChatGPT Health are not validated for clinical decision-making. A major study has documented that these tools miss emergencies that no trained clinician would miss, including failing to recommend emergency care for respiratory failure and diabetic ketoacidosis.

Do not use the AI’s written explanation as confirmation that its recommendation is correct. Research shows the reasoning and the recommendation can diverge without warning.
Be aware that the way you frame a question to an AI tool changes the answer. Including contextual information like “my colleague said it’s probably nothing” can shift a recommendation away from appropriate urgency.
As you rotate across affiliate sites (LCMC, FMOLHS, Ochsner), you may encounter different AI tools with different policies governing their use. When in doubt, ask your supervising attending about site-specific expectations.
If a patient mentions that they have consulted an AI tool about their symptoms, take that seriously. Evidence shows patients act on AI-generated advice regardless of its accuracy, and this study documents the specific ways that advice can be wrong.

Sources

Ramaswamy, A. et al. “ChatGPT Health performance in a structured test of triage recommendations.” Nature Medicine (2026). doi:10.1038/s41591-026-04297-7

Jones, N. “Your AI Agent Knows the Answer and Sometimes Recommends the Opposite Thing.” Nate Jones AI (2026). Analytical framework for the four failure modes adapted for medical education context.