OpenEvidence Assessment
Clinical Education & Technology Analysis · LSU Health

OpenEvidence: A Clinical and Strategic Assessment for Academic Medical Practice

Evidence synthesis, failure modes, governance, and deployment considerations across Louisiana health systems
Ram Paragi · [email protected]
April 9, 2026 · Prepared for clinical department chairs, program directors, faculty, residents, fellows & students · Ochsner · LCMC Health · FMOL Health / OLOLRMC · Lake Charles Memorial
Abstract
NOTICE: PERSONAL EDUCATIONAL ASSESSMENT — NOT AN INSTITUTIONAL MANDATE

OpenEvidence is the most widely adopted AI-powered clinical decision support platform among U.S. physicians as of April 2026, reporting daily use by more than 40% of the nation's physician workforce and over 20 million clinical consultations per month. The platform is valued at $12 billion following a $250 million Series D round in January 2026. This report synthesizes available evidence on its technical architecture, clinical performance, business model, competitive positioning, and failure taxonomy. It integrates current AI implementation data from Ochsner Health, LCMC Health, FMOL Health / Our Lady of the Lake, and Lake Charles Memorial Hospital — the four health systems across which readers of this report rotate. The report concludes with specific guidance for medical educators, program directors, residents, and students on appropriate use, required verification practices, PHI compliance, and the risks of automation bias and diagnostic deskilling.

Key terms: clinical decision support, artificial intelligence, large language models, retrieval-augmented generation, medical education, physician burnout, prior authorization, coding intelligence
Section 1

Platform Overview

Section at a glance
OpenEvidence started as a clinical search engine and has grown into a six-product clinical operating suite in under two years. Understanding what it actually is — versus what people think it is — matters for how you use it.
What works for you
  • Free, immediate access — no procurement wait
  • Covers the full clinical day: search, documentation, coding, communications
  • Dotflows let you customize for your specialty
  • Prior auth automation reduces administrative time
What works against you
  • More functions = more failure points now chained together
  • Coding Intelligence errors propagate into billing claims
  • The tool you used last month is different from the tool today
  • Free access creates HIPAA blind spots for trainees
What you should do
  • Know which product you are using at any given moment
  • Never enter PHI unless your institution has an active BAA
  • Treat each product function separately — the search engine has different reliability than Coding Intelligence

OpenEvidence is an AI-powered medical search and evidence synthesis platform that answers point-of-care clinical questions by synthesizing peer-reviewed literature from licensed sources and returning responses with inline citations. Access is free for NPI-verified U.S. healthcare professionals. The platform was founded in 2021 by Daniel Nadler — who previously built Kensho, a financial data AI company acquired by S&P Global for $700 million — and Zachary Ziegler (CTO). The company was incubated through the Mayo Clinic Platform Accelerate program, which remains an investor.

The platform's primary clinical use case is real-time evidence retrieval at the point of care. A physician types a natural-language question; the platform returns a synthesized, cited response within seconds. But OpenEvidence in 2026 is considerably more than a search engine. Six distinct clinical functions now run from the same platform:

2022–2024 · Core product
Evidence search engine
Natural-language queries answered with citations from 35 million licensed peer-reviewed publications. Deterministic citation linking — answers rejected if not properly sourced.
July 2025
DeepConsult
Agentic reasoning layer that cross-references hundreds of studies in parallel. Each run requires approximately 100× the compute of a standard search query. Free to all verified U.S. clinicians.
August 2025
Visits (ambient documentation)
Transcribes patient encounters and generates clinical notes with evidence integration inline, including assessment and plan enrichment with current guidelines. Supports custom note templates.
February 2026
Doctor Dialer
HIPAA-secure communications layer: calling, messaging, faxing, voicemail with clinical AI integrated throughout. As of the wide release, over 37 million minutes of doctor-patient communication logged.
March 26, 2026
Coding Intelligence
Surfaces ICD-10, E&M level, and CPT code suggestions inline within the Visits documentation workflow. Writes Medical Decision-Making rationale directly into the clinical note.
April 2–3, 2026
Tandem prior authorization integration
Partnership with Tandem automates prior authorization from prescription generation through payer submission, denial appeal, and pharmacy routing. Announced April 2; confirmed live April 3.
April 7, 2026
Dotflows
Reusable natural-language prompt templates that customize how OpenEvidence responds. Clinicians invoke them by typing "." in the search bar. A community marketplace allows clinicians to share vetted templates.
Section 2

Scale and Adoption

Section at a glance
OpenEvidence is genuinely dominant among U.S. physicians — not because hospitals deployed it, but because physicians chose it voluntarily. That tells you something real about its clinical utility. It also means usage outpaced governance.
What works for you
  • 40%+ of U.S. physicians use it daily — peer validation at scale
  • Accounts for 44.9% of all physician AI usage — dominant in a fragmented market
  • Adopted equally across experience levels and specialties
  • Usage peaks during clinical hours — behavior matches intended use
What works against you
  • Viral adoption bypassed institutional review at most hospitals
  • 21% of physicians surveyed are highly skeptical — the concerned minority often has valid concerns
  • Adoption speed ≠ safety validation speed
  • Psychiatrists raised the most pointed concerns around bias and liability
What you should do
  • Don't let peer adoption substitute for your own critical evaluation
  • Ask your program director whether your rotation site has a formal AI use policy
  • If no policy exists, operate as if you are on a non-BAA individual account
40%+
of U.S. physicians using platform daily
Apr 2026
760K+
NPI-verified registered healthcare professionals
Dec 2025
20M+
clinical consultations per month
Jan 2026
1M
consultations in a single 24-hour period
Mar 10, 2026
10,000+
hospitals and medical centers
Ongoing
$12B
valuation · Series D · Jan 2026
~120× ARR
Data quality note

The 40% daily physician penetration figure is cited by OpenEvidence in press communications and confirmed by independent survey data, but no published methodology specifies how "daily use" is defined or sampled. It is consistent with independent sources (the 2025 Physicians AI Report across 1,000+ physicians; a 2026 hospitalist survey at a large urban academic center), but should be understood as an approximation, not a precisely validated census figure.

What the independent physician data shows

A 2026 survey of hospitalists at a large urban academic tertiary care center found that 66.7% of respondents used AI in clinical practice, with OpenEvidence used by 51.9% of the total cohort — more than any other tool by a wide margin. The survey found no significant differences in AI usage by years of practice, shift type, sex, or provider designation. The assumption that younger physicians drive adoption disproportionately did not hold.

The 2025 Physicians AI Report (1,000+ physicians, 106 specialties) identified 71 unique AI applications in use. OpenEvidence alone accounted for 44.9% of all reported physician usage — more than all other 70 tools combined.

Physician sentiment

A Sermo poll found 20% of physicians described themselves as very supportive of OpenEvidence, 54% as cautiously open, and 21% as highly skeptical or concerned. Primary care physicians most frequently cited it as a major time-saver. Psychiatrists raised the most concerns — specifically around database preparation, inherent biases, and liability implications.

Figure 1 — Physician AI tool usage distribution (2025 Physicians AI Report, N=1,000+)
Fig. 1. OpenEvidence accounts for 44.9% of reported AI usage across 106 medical specialties. The remaining 55.1% is distributed across 70 other applications.

Funding trajectory

Figure 2 — Funding rounds and valuation, 2025–2026
Fig. 2. OpenEvidence raised approximately $700M over 12 months. The $12B Series D valuation represents ~120× ARR, pricing in a platform outcome not yet fully secured.
Section 3

Technology and Architecture

Section at a glance
The licensed journal content is the real moat — not the AI itself. The technology is sophisticated, but the relationships with NEJM, JAMA, and NCCN are what no competitor can quickly replicate. The RAG architecture has real limitations that the benchmarks don't fully expose.
What works for you
  • Trained on licensed NEJM/JAMA full-text — not the open internet or Wikipedia
  • Refuses to answer when it cannot source a response (vs. hallucinating)
  • NCCN treatment algorithms are retrievable as structured decision logic
  • Computer vision models can parse figures and tables from papers
What works against you
  • Graph RAG claim is unverified — MedXpertQA performance suggests limits
  • Chunking strategy for structured medical documents not publicly described
  • Citation presence does not guarantee citation accuracy or applicability
  • No published recall, precision, or retrieval quality metrics
What you should do
  • Always click the citation — verify the source actually supports the claim
  • Check the publication date of the cited paper — guidelines evolve
  • For NCCN queries, verify the platform is citing the current guideline version
  • Treat complex multi-system queries with additional skepticism

OpenEvidence's core AI stack is trained exclusively on licensed medical texts — not the public internet. The architecture is a multi-agent hub-and-spoke system: a central "conductor" AI performs intent analysis and routes each physician query to the most relevant subspecialty model before assembling a final response. The platform describes 160+ subspecialty models. All responses are rejected if they cannot be linked to a verified source citation — the system refuses to answer rather than hallucinate an unsourced response.

The backend runs on Google Cloud Platform; the frontend on Next.js with Vercel Fluid compute (a production infrastructure detail confirmed by internal audit records, not marketing claims). The system is multimodal and multicloud, as stated by CEO Daniel Nadler at the JP Morgan Healthcare Conference in January 2026.

Content licensing: the primary structural moat

Partner Content scope Strategic significance
NEJM Group Full text, figures, tables from NEJM, NEJM Evidence, NEJM AI, NEJM Catalyst, NEJM Journal Watch — back to 1990 NEJM named OE "best AI tool for medical information." Formal licensed agreement, not a web scrape.
JAMA Network Full text from JAMA + all 11 specialty journals (Oncology, Neurology, Cardiology, etc.) Covers the most-cited specialty journals in clinical medicine.
NCCN Treatment algorithms, flowcharts, pathways — including oncological reasoning agents built around guideline structure NCCN guidelines are the standard of care for oncology. No other AI has licensed the algorithm structure.
Cochrane Full-text systematic reviews and meta-analyses, figures, tables Highest level of published evidence synthesis. Differentiates OE from PubMed-only competitors.
ACC, ADA, AAFP, ACEP, ASAM, AAOS, GINA, NORD, SSO Clinical guidelines, specialty society standards Specialty society partnerships require individual negotiation. Breadth matters for cross-specialty queries.
Wiley, AMA Broad peer-reviewed biomedical literature Corpus expansion — 35M+ publications total.
Table 1. OpenEvidence content licensing partnerships (as of April 2026). Each agreement required individual institutional negotiation. Competitors — including OpenAI, Google, and Anthropic — must replicate each separately.
Why this moat matters

These institutions have a structural interest in OpenEvidence succeeding. Unlike OpenAI, Anthropic, or Google — which train competing frontier models and sell to hospital enterprise competitors — OpenEvidence uses licensed content for retrieval, not as training data for a competing AI platform. This distinction makes the licensing relationship less conflicted and more durable than it would be with a general-purpose AI lab.

Graph RAG and multi-hop reasoning: the claim and the gap

OpenEvidence describes its retrieval system — called SystemAI — as a graph-based retrieval-augmented generation architecture. Medical knowledge graphs map relationships between diseases, symptoms, drugs, and biological pathways. The system traverses these relational pathways to aggregate evidence across multiple documents before the generative phase. The intended capability: answering queries that require connecting a genetic marker to a drug's metabolic pathway to a secondary comorbidity — connections that are not explicitly stated in any single source document.

Technical gap — not independently verified

The graph RAG claim is the company's own description of SystemAI. No published independent technical audit of the architecture exists. Critically, the clinical performance data from MedXpertQA (see Section 4) shows the system fails on precisely the multi-system, multi-document reasoning this architecture implies it should handle — suggesting the graph traversal capability may be limited to certain query types, or that it does not close the gap on complex subspecialty reasoning.

Additional technical questions the available evidence does not answer: how the licensed corpus is chunked across document types (NCCN algorithms structured as decision trees require different chunking than NEJM trial reports), whether embedding models are domain-specific or general-purpose, and what the failure recovery architecture looks like for a system processing 20+ million consultations per month.

The Alexandria / Atropos integration

When published literature cannot answer a clinical question — which is the case for an estimated 80% of daily decisions in some specialties — OpenEvidence queries Alexandria, a real-world evidence repository from Atropos Health containing over 10 million observational studies generated from EHR and claims data. A pipeline analysis of approximately 3,000 complex physician questions found that PubMed-based retrieval answered approximately 44% of queries; the Alexandria integration provided actionable answers for an additional 50.1%. These figures come from Atropos Health's own published research and represent the best available evidence, not independent third-party validation of OpenEvidence's specific implementation.

Section 4

Clinical Performance: What the Evidence Actually Shows

Section at a glance
The 100% USMLE score is real but irrelevant to how you will actually use this tool. The number that matters for clinical practice is 34% — the accuracy on complex subspecialty board questions. The tool reinforces what you already think. It rarely catches what you missed.
What works for you
  • Very high scores on clarity (3.75/4.0) and relevance (3.75/4.0) in prospective study
  • Excellent for validating a hypothesis you've already formed
  • Fast retrieval of evidence-based support for documentation
  • Strong for common conditions with abundant published evidence
What works against you
  • Impact on altering clinical decision: 1.95/4.0 — it confirms, rarely redirects
  • 34% accuracy on complex subspecialty board questions
  • Never outputs "I don't know" — generates confident answers regardless
  • Only 25% agreement with a comparator AI on the same cases
What you should do
  • Form your differential first, then use OE to check it — not the other way around
  • For subspecialty or complex presentations, require primary source verification
  • Do not use OE output as a substitute for a specialist consult in your blind spot areas
  • Teach and document this hierarchy to your residents

OpenEvidence achieved 100% on the United States Medical Licensing Examination using the Kung et al. dataset — a benchmark drawn from publicly available USMLE Step 1, Step 2, and Step 3 questions in multiple-choice format. The system not only answered correctly but generated accurate reasoning chains explaining the underlying physiology.

This benchmark evaluates encyclopedic recall of established medical facts and recognition of classic presentations. It does not evaluate multi-step heuristic reasoning under diagnostic ambiguity, performance on atypical presentations, or the kinds of clinical judgment exercised by attending physicians managing complex inpatients. No study has independently replicated this result with a different question set.

MedXpertQA: where performance falls

The more informative evaluation used the MedXpertQA dataset — drawn from specialty board examinations, with ten possible answer choices (A through J) to eliminate guessing. Two independent physicians evaluated responses manually.

Figure 3 — OpenEvidence accuracy by body system (MedXpertQA benchmark)
Fig. 3. OpenEvidence highest overall accuracy: 34%. Best performance: muscular system (42.8%). Worst: skeletal system (21.9%). Both models fabricated an answer not among the ten choices on approximately 2% of questions, and neither ever responded "I don't know."
MetricOpenEvidenceDeep Consult (comparator)
Highest overall accuracy34%41%
Best subsystem performance42.8% (muscular)55% (digestive)
Worst subsystem performance21.9% (skeletal)30% (respiratory)
Evaluator concordance (repeatability)77%72%
Discordance between both AI models75% — they agreed only 25% of the time
Fabricated "K" answer (not among choices)~2%4–6%
"I don't know" responses0%0%
Table 2. MedXpertQA evaluation — complex subspecialty board-level questions with 10 answer choices. N not specified in the preprint source.

Point-of-care impact study

A prospective observational cohort study (NCT07199231) at Cambridge Health Alliance enrolled PGY-1 through PGY-6 residents in internal medicine, family medicine, adult psychiatry, and child psychiatry. Complementary retrospective analyses graded OpenEvidence outputs across five clinical domains:

Figure 4 — Cambridge Health Alliance resident cohort: output quality ratings (scale 0–4.0)
  • Relevance to clinical query3.75 / 4.0
  • Clarity of response3.55 / 4.0
  • Evidence-based support3.35 / 4.0
  • Overall physician satisfaction3.30 / 4.0
  • Impact on altering clinical decision1.95 / 4.0
Fig. 4. OpenEvidence functioned primarily as a high-speed validation engine — reinforcing pre-existing physician hypotheses — rather than substantially redirecting clinical reasoning. Four blinded physicians reviewed cases retrospectively.
What "impact score of 1.95" means clinically

The tool reinforced correct diagnoses and provided citable evidence efficiently. It rarely caught overlooked diagnoses or redirected a physician toward a substantially different management approach. For experienced clinicians, this is appropriate use — rapid validation with sourced backup. For trainees who have not yet formed an independent differential, it removes the cognitive effort that builds clinical reasoning skill over time.

Section 5

Failure Mode Taxonomy

Section at a glance
Four specific failure patterns explain most of the clinical risk. These are not random glitches — they are predictable, structural, and detectable if you know what to look for. The most dangerous is FM-1: the model is most confident precisely where it is most likely to be wrong.
What is predictable
  • Failures cluster at distribution tails — rare codes, complex cases, recent approvals
  • Citation guardrail works reliably for source presence
  • Common conditions are well-handled and failures are infrequent
  • Failure patterns are consistent enough to build guardrails around
What creates danger
  • FM-1: Highest confidence = highest error risk at the tails
  • FM-2: Model generates a prior auth letter that will get denied
  • FM-3: What you say in the exam room shapes the note and the claim
  • FM-4: A hallucinated MDM rationale looks identical to a real one
What you should do
  • Be most skeptical when the answer sounds most authoritative
  • Audit coding suggestions for complex encounters before submission
  • Review prior auth letters against payer criteria, not just clinical logic
  • Consider adversarial testing: give the tool a known-difficult case and see how it fails

OpenEvidence's six-product suite creates a chained failure surface — where search outputs inform Visits notes, which feed Coding Intelligence suggestions, which populate Tandem prior auth letters — errors in one function can propagate downstream. The four failure modes below apply across this integrated system.

The platform performs well on the middle of the training distribution — common chronic disease management (hypertension, type 2 diabetes, hyperlipidemia), standard E&M coding (99213/99214), routine prior auth letters for formulary-tier drugs. At the tails — rare presentations, uncommon CPT codes, recently approved treatments, complex subspecialty presentations — confidence does not decrease. The model generates confidently phrased, well-cited responses even when its training signal is thin.

Concrete example: A cardiologist queries a PCSK9 inhibitor combination approved eight months ago. OpenEvidence returns a confident, well-cited response based on pre-approval trial data, missing two post-marketing safety signals published after the training cutoff.

Why USMLE benchmarks mask this: USMLE tests the middle of the distribution. MedXpertQA tests the tails — and accuracy there is 21–34%.

Detection difficulty: Hard. Standard benchmarks actively hide this failure mode. Requires specialty-specific adversarial testing with tail cases.

The model's internal chain of thought correctly identifies a risk signal, but the final output overrides it with the statistically dominant response. In coding: the reasoning trace flags multiple comorbidities managed, data reviewed from external records, and high MDM complexity — but the final E&M code lands at 99214 rather than 99215 because 99215 is the statistical minority in training data. In prior authorization: the reasoning chain identifies that the requested biologic has limited evidence for the patient's specific indication variant, but the prior auth letter is confidently written because generating a supporting letter is the task the model was trained to do. The payer will likely deny it.

Detection difficulty: Hard. Requires logging and auditing the reasoning trace separately from the output. If only evaluating outputs — denial rates, coding accuracy — FM-2 is nearly invisible.

The Visits system ingests the full physician-patient encounter transcript. What the physician says in the room — how they frame the problem, what they emphasize or dismiss — shapes the note and coding. A physician who says "I think this is her anxiety again" while the patient's troponin is elevated steers the note toward a lower-acuity encounter. Prior notes in the FHIR-integrated chart — characterizations like "drug-seeking" or "frequent flier" — can anchor the current note's framing.

The pharmaceutical advertising vector is a structural version of this: A physician who viewed a diabetes drug advertisement during a previous search session carries that exposure into the next patient encounter. The encounter transcript may reflect the drug's marketed clinical positioning, which then propagates into the prior auth letter. This is not a clinical accuracy benchmark failure — it is a systematic prior-shifting mechanism that no citation guardrail detects.

Detection difficulty: Hard. Requires adversarial test cases where verbal framing contradicts structured lab/vital data.

OpenEvidence's safety architecture enforces citation grounding — responses are rejected if they cannot be sourced. This is a citation-level guardrail, not a risk-level guardrail. A response that cites a superseded guideline, accurately summarizes a methodologically flawed study, or presents a case-report-level drug interaction as equivalent to a well-replicated severe interaction will pass every citation check while still being clinically dangerous.

The MDM rationale written by Coding Intelligence into clinical notes is itself the guardrail: if the rationale sounds clinically coherent, it passes. A hallucinated MDM rationale that reads like a real one clears every surface-level review.

The BiPAP documented case: A hospitalist queried OpenEvidence for standard BiPAP settings for respiratory failure. The platform retrieved a specific clinical trial that used a narrow pressure range for its particular cohort, and presented those settings as the universal clinical recommendation. The response had citations. It looked authoritative. The settings were inappropriate for the general patient population.

Detection difficulty: Moderate. Citation presence is auditable; citation accuracy and clinical applicability require human clinical review.

Failure modeSeverityDetection difficultyPriority
FM-1 · Inverted U (tail-case overconfidence)CriticalHard1 (tie)
FM-2 · Reasoning-output gapCriticalHard1 (tie)
FM-3 · Social context hijack + pharma ad vectorCritical / HighHard2
FM-4 · Guardrail miscalibrationHighModerate3
Table 3. Failure mode priority matrix. FM-1 and FM-2 are tied for first because they co-occur in the Coding Intelligence workflow: a model may parse complexity correctly (FM-2 awareness) but assign a base-rate code (FM-2 action failure) with high confidence (FM-1 tail error).
Section 6

Business Model and Conflict of Interest

Section at a glance
The tool is free to you because pharmaceutical companies pay $70–$1,000+ CPM to reach you at the moment you are deciding what to prescribe. That structural fact is not an accusation — it is the business model. Understanding it is the minimum requirement for using the tool responsibly in an academic setting.
What is defensible
  • Company states content and ad systems are "fully unconnected"
  • Free access enables use in under-resourced settings
  • Advertising revenue cross-subsidizes features that benefit clinicians directly
  • No evidence of direct content manipulation has been published
What is concerning
  • No independent audit of the content-ad separation claim exists
  • Amaro acquisition brought contextual ad targeting in-house (diabetes query → diabetes drug ad)
  • Practice Fusion precedent: DOJ paid $145M for undisclosed pharma-influenced CDS
  • Longitudinal prescribing behavior data exists inside OE and has not been published
What you should do
  • Notice when ads appear — log the drug category and the query context
  • If your institution has a P&T committee, flag the advertising model for review before any enterprise deployment
  • Ask: would I trust this answer the same way if I knew who advertised on this query?

CPMs range from $70 to over $1,000, targeting 760,000 NPI-verified U.S. prescribers at the precise moment they are answering a clinical question that may inform a prescribing decision. This is the most precisely targeted physician ad inventory in existence.

Figure 5 — Advertising CPM comparison: OpenEvidence vs. general digital platforms
Fig. 5. The CPM premium reflects verified prescriber identity at point-of-care — not just demographics. This premium is the business model's structural foundation and its primary liability.

The structural conflict

OpenEvidence states that "the OpenEvidence information system and the ad display system are fully unconnected systems" and that "advertisements shall not be considered an endorsement." This is a self-attestation. No independent audit of this claim has been published. The Amaro acquisition in September 2025 — an ad-tech startup focused on advertising infrastructure and automation — brought contextual targeting in-house: a diabetes query triggers a diabetes drug advertisement.

When a doctor searches "treatment options for Type 2 diabetes," pharmaceutical companies can surface their FDA-approved treatments right there in the results — Google AdWords meets clinical decision support at the exact moment of prescribing consideration. — Repositioning analysis, April 2026
Regulatory precedent: Practice Fusion

Practice Fusion, a clinical decision support company, paid a $145 million DOJ settlement for undisclosed pharmaceutical-influenced clinical decision support alerts. OpenEvidence is not accused of any comparable misconduct. But health system legal and compliance teams are aware of this precedent, and it is the lens through which institutional legal review of OpenEvidence deployments will occur. Any academic medical center deploying OpenEvidence enterprise-wide should document its analysis of the advertising-content separation claim before deployment.

The advertising-enterprise contradiction

The ad-supported free model that enabled 40%+ physician adoption is structurally incompatible with enterprise-level institutional deployment. Health system compliance teams routinely require ad-free environments as a contracting standard for clinical AI tools. This creates a structural fork: the two business models — pharma media and enterprise clinical AI — cannot both be primary. OpenEvidence has not publicly resolved this tension with a formally separated product architecture, though enterprise per-seat pricing exists for health systems like Mount Sinai.

Revenue streamEstimated sizeStructural durabilityRisk
Pharma advertising (primary)$100–150M ARRModerate — depends on physician trust staying intactHigh regulatory/reputational
Enterprise EHR subscriptionsEmerging (Mount Sinai model)High — if ad-free version available and Epic cooperatesModerate — Epic gating risk
Veeva Open Vista (pharma commercial)Pilot — first revenue expected 2026Potentially high via Veeva channelDeepens pharma conflict exposure
API licensingEarly / stated future streamHigh if developedLow
Table 4. OpenEvidence revenue streams as of April 2026. The $100M figure is from January 2026 Series D disclosures; the $150M figure appears in one strategic analysis using a different methodology. Both are based on public reporting, not audited financials.
Section 7

Product Suite: Coding Intelligence and Prior Authorization

Section at a glance
Coding Intelligence and Tandem are OpenEvidence's pitch to hospital CFOs: the tool now generates revenue, not just saves time. For clinicians, this creates a new responsibility — AI-generated codes and prior auth letters enter the legal and financial record. A wrong code carries billing liability. A letter that misses a payer criterion delays patient care.
What works for you
  • Captures commonly missed CPT codes that physicians undercode out of habit
  • CCI rules engine reduces claim denials for incompatible code pairs
  • Prior auth automation reduces the most universally hated administrative task
  • RVU sequencing maximizes reimbursement within compliance rules
What works against you
  • MDM rationale is written into the permanent clinical note — errors become part of the legal record
  • High-complexity E&M code assignments are exactly where FM-1 and FM-2 collide
  • Prior auth letters may be well-written clinically but miss payer-specific denial criteria
  • Physicians bear the compliance liability for AI-suggested codes they approve
What you should do
  • Never auto-sign AI-generated coding without reading the MDM rationale
  • For complex encounters, validate E&M level against AMA MDM complexity criteria independently
  • Before submitting a prior auth letter, verify it addresses the payer's specific step therapy requirements
  • Know that approval of the code is your professional and legal responsibility, not the AI's

OpenEvidence has reoriented from a reference tool into a revenue-generating enterprise asset. Hospital CFOs are more willing to pay for AI that demonstrably captures missed billing revenue than for AI that saves physician time. This is the stated industry logic behind both Coding Intelligence and the Tandem partnership.

Coding Intelligence (launched March 26, 2026)

FeatureMechanismFinancial impact
E&M leveling & MDM rationale Analyzes visit transcript; suggests E&M level; writes MDM rationale directly into the clinical note Ensures documentation supports the selected level; reduces successful payer audit challenges
CPT code suggestions Surfaces context-dependent CPT codes based on documented actions; catches uncommon procedural codes Captures missed reimbursement from habitually under-coded visits
RVU-optimized sequencing Sequences multiple CPT codes by expected RVU impact Maximizes revenue under Medicare's Multiple Procedure Payment Reduction rules
CCI compliance engine Filters suggested codes through Correct Coding Initiative rules to remove incompatible procedure pairs Reduces claim denials and compliance flags
Table 5. Coding Intelligence feature functions. The MDM rationale is written into the permanent clinical note — see FM-4 (Section 5) for the guardrail limitation this creates.

Tandem prior authorization (live April 3, 2026)

The Tandem integration automates the prior authorization workflow in four steps: the physician generates a prescription within the EHR; Tandem's system identifies the required criteria from the OpenEvidence-supported clinical notes and auto-populates the payer's required form, flagging missing information; on denial, the system auto-generates an evidence-backed appeal; finally, the system routes the approved prescription to the preferred pharmacy and enrolls the patient in applicable manufacturer savings programs.

Prior auth failure mode (FM-2)

The prior auth letter generation is particularly vulnerable to FM-2. The model may identify in its reasoning chain that the requested medication has limited evidence for the patient's specific indication variant — but the prior auth letter it generates is confidently written because generating a supporting letter is the trained task. The letter may be well-constructed clinically and still fail payer review because the model did not address the specific denial criteria for that payer and drug combination. Physicians should review AI-generated prior auth letters against payer-specific criteria before submission.

Section 8

Competitive Landscape

Section at a glance
OpenEvidence leads on physician adoption and internet search volume, but trails UpToDate on independent clinical reasoning scores and editorial independence. The more important competitive dynamic is not UpToDate — it is Epic, which is building natively inside the EHR where OE needs to go.
OE's real advantages
  • 98.7% of all AI clinical reference searches — usage dominance is real
  • Free vs. $530/year for UpToDate Expert AI
  • Licensed content competitors (ChatGPT Health, Claude) cannot access without equivalent negotiations
  • Faster synthesis of newly published guidelines than editorially curated tools
Structural vulnerabilities
  • UpToDate Expert AI scores 71/100 vs. OE's 62/100 on clinical reasoning depth
  • Epic's Art agent — natively inside the EHR — is a direct threat to OE's embedding strategy
  • FMOL Health (your OLOLRMC rotation site) already adopted Epic's native AI scribe
  • Dragon Copilot lists OE as a content vendor, not a platform partner
What you should do
  • Use the tool that best fits the clinical moment — not just the most familiar one
  • For high-stakes management decisions with established standards of care, UpToDate Expert AI provides stronger editorial provenance
  • Pay attention to which AI your rotation system has embedded — the tool in the EHR is the tool you'll actually use
Figure 6 — Clinical decision support platform comparison: 2026 independent scoring
Fig. 6. Composite scores from 2026 independent platform evaluation. UpToDate Expert AI leads on total aggregate score (71/100) and clinical reasoning depth. OpenEvidence leads on internet search volume (98.7% of all searches among AI-enabled clinical reference tools). These serve different use cases.
PlatformRevenue / scaleOE advantageOE vulnerability
UpToDate (Wolters Kluwer) $595M revenue, $500/seat, 30 years of physician habit, no advertising Free; faster synthesis; current guidelines; AI-native interface UpToDate has no advertising conflict; 30+ years of institutional trust; human editorial curation
ChatGPT Health (OpenAI) 800M weekly users; HIPAA-compliant; institutional distribution Exclusive journal licensing (NEJM/JAMA) not available to OpenAI; physician behavioral dataset OpenAI scale; improving clinical capabilities; no advertising conflict
Claude for Healthcare (Anthropic) $19B ARR; 80% enterprise revenue; CMS, ICD-10, PubMed integrations Licensed private content (NEJM/JAMA vs. public PubMed) Anthropic's enterprise relationships and Cowork momentum; $380B valuation
Doximity (Doximity GPT) $570M TTM revenue; 80%+ physician penetration; NYSE listed Deeper clinical decision support; journal licensing; evidence synthesis Doximity has larger physician network; acquired Pathway Medical ($63M); active litigation with OE
AMBOSS Education-focused; knowledge depth; learning science Broader workflow integration; real-time evidence AMBOSS has deeper knowledge structure for learning; complementary not competitive
PlatformThreat vectorRisk rating
Epic (Art / Cosmos AI) Epic's native AI scribe (Art) released February 2026 with ambient documentation and order suggestions. Cosmos AI trained on 8+ billion patient encounters. Over 200 AI features in development for 2026. Epic is the EHR for Mount Sinai (OE's flagship integration) — if Epic builds native evidence synthesis, OE becomes optional rather than embedded. Critical
Microsoft / Dragon Copilot OE listed as one of three content reference partners alongside Elsevier and UpToDate. Content vendor position inside Microsoft's platform is replaceable. Microsoft is building native clinical decision support capabilities and has deep Epic integration via Nuance. High
Google Ventures / MedLM GV is OE's Series B and C lead investor while Google builds a directly competing physician workflow AI. GV board access creates potential information asymmetry around OE's most sensitive asset — the physician behavioral query dataset. This is the most structurally unresolved risk in the entire analysis. High + governance tension
Veeva (Open Vista) Aligned — not a threat. Veeva is a monetization partner for behavioral data via 1,500+ pharma customers. This relationship converts OE from a margin compressor (between physician and hyperscaler) to a token multiplier for Veeva's infrastructure. Aligned

See Section 9 for detailed discussion of Ochsner, LCMC Health, FMOL/OLOLRMC, and Lake Charles Memorial Hospital AI implementations and their relationship to OpenEvidence deployment.

Section 9

Louisiana Health Systems: What's Deployed and What It Means

Section at a glance
Your four rotation sites are each at a different point in their AI implementation journey. All run Epic. None has publicly announced an enterprise OpenEvidence contract. That means most physician use at your sites is individual, free-tier, non-BAA access — which has direct PHI and compliance implications for you as a trainee.
What the landscape offers
  • All four systems on Epic — creating interoperability infrastructure for future OE enterprise deployment
  • Ochsner's DeepScribe showing real deskilling prevention via ambient documentation (75% adoption, 3–4 min/note)
  • LCMC's Nabla and FMOL's Epic Art scribe are reducing documentation burden across your clinical environments
  • Louisiana MyChart Central statewide launch shows coordinated health IT investment
What requires your attention
  • No confirmed OE enterprise BAA at any of your four sites = individual accounts only = HIPAA exposure
  • LCMC's Nabla deployment raised patient consent and transparency concerns (reported Jan 2026)
  • Multiple AI tools across rotation sites creates inconsistent training and risk environments
  • LCMHS is in early AI exploration — least infrastructure support for safe AI use
What you should do at each site
  • Ochsner/LCMC/FMOL: Ask your supervisor if OE is covered under an institutional BAA before querying with clinical context
  • LCMHS: Assume no enterprise coverage — de-identify all queries
  • All sites: Ask your CMIO or informatics team what the AI governance policy is

The four health systems covered by this report encounter different AI technology landscapes. None of the four systems has publicly announced an OpenEvidence enterprise contract as of April 2026. But all are actively deploying AI in clinical workflows — predominantly ambient documentation — and all operate on Epic, creating the EHR infrastructure through which OpenEvidence can be accessed individually or (if a system contract is executed) enterprise-wide.

Louisiana Epic convergence

In October 2025, Ochsner Health, LCMC Health, Baton Rouge General, North Oaks Health System, FMOL Health, and Covington-based St. Tammany Health jointly launched Epic MyChart Central statewide — a unified patient portal login across all participating Epic organizations. This level of Epic integration across Louisiana health systems creates the interoperability infrastructure for enterprise OpenEvidence deployment, if any of these systems pursue it.

Ochsner Health

Ochsner is the largest nonprofit healthcare provider in Louisiana, operating 47 hospitals and 370+ health and urgent care centers, employing approximately 40,000 team members and 5,000 physicians, and treating 1.6 million patients annually. It is the largest academic medical center in Louisiana and the EHR market leader in the region, operating fully on Epic with AI Steering and Data Governance committees that review every AI deployment.

Ochsner's current AI deployment landscape is dominated by ambient documentation and predictive analytics. In July 2024, Ochsner signed an enterprise agreement with DeepScribe to deploy ambient AI documentation across all 4,700 clinicians at 46 hospitals and 370 centers. The pilot generated 75% clinician adoption during the initial launch, with one Ochsner nephrologist reporting documentation time reduced from "two to three hours a day to three to four minutes per note." An oncology NP noted the platform "captures way more than I'm able to, but writes it so succinctly."

Beyond ambient documentation, Ochsner uses AI for predictive sepsis detection, AI-powered radiologist diagnostic prioritization, pharmacy workflow automation for prior authorizations, AI-assisted patient messaging through Epic (piloted with 100+ clinicians), AI-driven appointment scheduling, and a suite of clinical AI agents for real-time health insights. AI tools for clinical use require mandatory training before access — this policy has expanded from voluntary to mandatory as use cases became more complex.

Ochsner and OpenEvidence: the relevant gap

Ochsner's AI Steering Committee reviews every tool against patient privacy, core values, and clinical safety criteria. The DeepScribe ambient documentation platform is integrated with Epic. OpenEvidence is not among Ochsner's publicly announced enterprise AI deployments. Individual physicians may be using it via free NPI-verified access. Any institutional deployment would require Steering Committee review, including an analysis of the pharmaceutical advertising model and the advertising-clinical content separation claim — the same analysis required at any academic medical center.

LCMC Health

LCMC Health is a New Orleans-based, not-for-profit system operating eight hospitals — including University Medical Center New Orleans, Children's Hospital New Orleans (Manning Family Children's), East Jefferson General, West Jefferson Medical Center, Touro, Lakeview, Lakeside, and New Orleans East. It serves approximately 1.5 million annual patient visits with 2,800+ employed clinicians and operates in partnership with LSU Health Sciences Center and Tulane University School of Medicine.

LCMC reached HIMSS EMRAM Stage 7 (the highest EHR adoption certification) at University Medical Center and Children's Hospital. In December 2025, LCMC selected Nabla — a French ambient AI company — for a system-wide rollout integrated directly into its Epic EHR. Nabla captures clinician-patient conversations and automatically generates structured clinical documentation, with at-cursor dictation as an additional option. LCMC's CMIO Dr. Damon Dietrich has been explicit about the competitive rationale: "We had to get this to our doctors. We are mission-critical about this. We're going to lose doctors to our competitor."

LCMC's AI adoption was organized in three waves: employed doctors first (November 2025), residents and attending clinicians at affiliated Tulane and LSU academic programs second, and all remaining clinicians (including hesitant users) third. This phased approach means that as a trainee at LCMC Health affiliated with Tulane or LSU, you were likely included in Wave 2 of Nabla deployment.

Patient consent and transparency: the LCMC-Nabla context

In January 2026, Verite News reported that LCMC patients were not being explicitly told that their medical visits were being recorded and analyzed by Nabla's AI. LCMC cited Louisiana's one-party consent recording laws (requiring only provider consent, not patient consent, for recording). Nabla states it does not store audio and uses de-identified data. This episode illustrates a broader issue relevant to OpenEvidence: the gap between technical compliance and patient expectations of transparency. Academic institutions should document their consent and disclosure practices for any clinical AI tool, including OpenEvidence, before deployment.

FMOL Health / Our Lady of the Lake Regional Medical Center (OLOLRMC)

FMOL Health (Franciscan Missionaries of Our Lady Health System) includes Our Lady of the Lake in Baton Rouge — an 850-bed Level I trauma center and a primary teaching site for LSU School of Medicine GME programs, consistently named among the best hospitals nationally. OLOLRMC is LSU's Championship Health Partner, and as of March 2026, performed Louisiana's first single-port transabdominal colorectal surgery. In 2022, it upgraded to a Level I trauma center — the only one in the Capital Region and one of three in Louisiana.

FMOL Health's CIO Will Landry told Becker's in August 2025: "FMOL Health has had a lot of success with ambient listening technologies." In early March 2026 — following a one-month pilot — FMOL signed an enterprise license for Epic's native AI Charting (the "Art" agent), making it one of the earliest adopters of Epic's own ambient scribe, released in February 2026. FMOL's ambulatory CMIO Dr. Bobby Dupre cited the native Epic integration ("the linkage with native Epic functionality is just hard to beat"), lower hallucination rates compared to other ambient AI tools, built-in provider note personalization, and lower long-term maintenance cost as the deciding factors.

FMOL previously held individual licenses for two other AI scribes before selecting Epic's native tool. The enterprise license covers FMOL's nine-hospital system.

Strategic implication for OpenEvidence at FMOL/OLOLRMC

FMOL's early adoption of Epic's native AI Charting is the clearest local example of the competitive dynamic this report identifies at the national level: Epic entering ambient documentation directly reduces the space for third-party ambient AI tools. At the same time, Epic's native Art agent handles documentation — it does not provide the evidence synthesis, clinical reference quality, and licensed literature access that OpenEvidence offers. The two tools serve different clinical moments and are likely complementary rather than mutually exclusive at the point-of-care level.

Lake Charles Memorial Hospital (LCMHS)

Lake Charles Memorial is the primary hospital serving southwest Louisiana. It completed an Epic EHR implementation (go-live) and as of 2025 began exploring AI initiatives including automated discharge summaries and care plans built on the Epic infrastructure. This is an earlier AI maturity stage than the larger New Orleans and Baton Rouge systems — the institution is in the "education and exploration" phase rather than the enterprise rollout phase.

Individual physicians at LCMHS likely use OpenEvidence independently through free NPI-verified access, consistent with the national pattern of bottom-up adoption that preceded any institutional contract at comparable facilities nationally. No enterprise OpenEvidence deployment at LCMHS has been publicly announced.

System EHR Ambient AI OpenEvidence enterprise status AI maturity
Ochsner Health Epic (full) DeepScribe (enterprise, 4,700 clinicians) Not publicly announced — individual use likely Advanced
LCMC Health Epic (HIMSS Stage 7) Nabla (enterprise, system-wide, Epic-integrated) Not publicly announced — individual use likely; trainees in Wave 2 Advanced
FMOL / OLOLRMC Epic Epic AI Charting "Art" (enterprise license, March 2026) Not publicly announced Advanced
Lake Charles Memorial Epic (recent go-live) Exploring AI initiatives — not yet enterprise ambient Not publicly announced Developing
Table 6. AI implementation status across Louisiana training sites as of April 2026. Data from public announcements and news reporting. No confidential information used.
Section 10

Regulatory, Legal, and HIPAA Considerations

Section at a glance
The legal framework is clear on one point: if you use an AI tool's output in patient care and something goes wrong, the liability is yours. No current legal framework assigns algorithmic malpractice to the software. For trainees, HIPAA compliance on non-BAA accounts is not a technicality — it is federal law.
What protects you
  • OpenEvidence is HIPAA-compliant with BAA available since April 2025
  • Enterprise accounts (if your institution has one) provide HIPAA coverage for PHI input
  • Platform provides citations — making your verification trail documentable
What exposes you
  • Without a BAA, any PHI you enter is your sole legal responsibility
  • No current "algorithmic malpractice" framework — clinical liability rests entirely with you
  • FDA may reclassify expanding agentic features as medical devices requiring clearance
  • Litigation with Doximity/Pathway is unresolved — legal basis for OE's trade secret claims is untested
What you must do
  • Confirm BAA status at every rotation site before entering any clinical context
  • De-identify all queries on free individual accounts — always
  • Document your independent clinical reasoning separately from AI-assisted steps
  • Never represent AI output as your own independent clinical judgment in notes

OpenEvidence achieved full HIPAA compliance in April 2025. Covered entities can securely input protected health information, provided the hospital system has executed a Business Associate Agreement (BAA) with OpenEvidence. For individual physicians and trainees using free NPI-verified accounts without a BAA — the default situation for most users — the Privacy Policy explicitly states that any PHI submitted is deemed unintentional and remains the "sole responsibility of the user, for which OpenEvidence disclaims all liability."

PHI hygiene for trainees — non-negotiable

Unless you are accessing OpenEvidence through an enterprise HIPAA-covered environment with a formal Business Associate Agreement — such as the Mount Sinai Epic integration or a formally contracted equivalent at your rotation site — do not enter any patient-identifying information into OpenEvidence. Entering identifiable patient data through a free individual account invites severe HIPAA violations and organizational liability. De-identify all queries before submission. This is not a preference; it is a legal requirement.

FDA regulatory positioning

OpenEvidence currently positions itself as a "support" tool that does not "offer diagnosis or treatment" — a classification that generally avoids FDA premarket notification requirements for higher-risk devices. As the platform expands into DeepConsult agentic reasoning, order-set recommendations, differential diagnosis generation, and Coding Intelligence MDM rationale written into permanent clinical notes, the gap between the regulatory positioning and the actual clinical function narrows. In January 2026, the FDA issued guidance reducing oversight of certain low-risk AI tools while simultaneously requiring clinical decision support tools to be designed so clinicians can evaluate and question AI recommendations rather than accept them automatically. The FDA regulatory ceiling for OpenEvidence's expanded product suite has not been tested.

Litigation: OpenEvidence v. Pathway Medical and Doximity

In February 2025, OpenEvidence sued Pathway Medical (a Canadian company) for trade secret misappropriation, alleging that Pathway used stolen NPI credentials to conduct "prompt injection attacks" to extract OpenEvidence's proprietary system prompts and architecture. In July 2025, Doximity acquired Pathway Medical for $63 million. In June 2025, OpenEvidence filed a separate suit against Doximity, alleging Doximity engineers posed as doctors to extract proprietary code via prompt injection. Doximity counter-sued, alleging false claims used as self-promotion. Bilateral litigation is ongoing.

A federal judge dismissed the original Pathway lawsuit in June 2025; OpenEvidence filed an amended complaint in August 2025 reframing the allegations as "an elaborate conspiracy." The case is now a groundbreaking test of whether prompt injection through a public interface constitutes trade secret misappropriation under the Defend Trade Secrets Act — a question no court has yet resolved.

Medico-legal liability

OpenEvidence's Terms of Use place the entire burden of clinical judgment on the human end-user. There is currently no legal framework for algorithmic malpractice. If a physician or resident relies on a hallucinated or misinterpreted guideline from OpenEvidence and patient harm results, the liability rests on the human physician for failing to meet the standard of care. The software developers and the corporate entity are shielded from clinical liability. This is not a hypothetical risk — it is the current legal reality for every AI tool in clinical use.

Section 11

Strategic Position and Structural Durability

Section at a glance
OpenEvidence has two assets no competitor can quickly replicate — the licensed journal corpus and the physician behavioral dataset. Everything else it calls a moat is either already commoditized or has a closing window of 12–24 months. The Google Ventures board access is the single most important unresolved structural risk.
Genuinely durable assets
  • NEJM/JAMA/NCCN licensing — individually negotiated, institutionally trusted, structurally hard to replicate
  • Physician behavioral query dataset at 20M+ consultations/month — captures what physicians don't know, in real time
  • NPI-verified prescriber identity — a CPM premium that Google and OpenAI cannot manufacture quickly
  • Veeva Open Vista — the clearest example of the company monetizing its data asset via an aligned partner
What is weaker than claimed
  • Citation-grounded RAG — every major competitor already demonstrates this
  • Hallucination reduction methods — GPT-5 class models close this gap within 12–18 months
  • USMLE 100% — a benchmark test, not a clinical performance validation
  • GV is both an investor and a competitor via MedLM — the governance tension is unresolved
What this means for your institution
  • OpenEvidence's content advantage over ChatGPT Health and Claude for Healthcare is real today — but ask vendors to show you how that gap holds in 18 months
  • Any institution considering an enterprise contract should request GV's governance documentation before signing
  • Watch whether Epic's Cosmos AI acquires guideline licensing — that is the signal that the structural moat is narrowing

What OpenEvidence genuinely owns

Two assets are structurally durable in ways that competitors cannot easily replicate:

The physician behavioral query dataset. Twenty million monthly clinical consultations from 760,000 NPI-verified healthcare professionals at the actual point of care generates data on what physicians are uncertain about — in real time, by specialty, by query type, by institution. This is structurally different from PubMed searches, patient health data, or consumer health queries. It captures clinical uncertainty, not clinical knowledge. This dataset cannot be reconstructed retroactively by any competitor who lacks the physician distribution scale.

Exclusive journal licensing agreements. NEJM, JAMA (all 11 specialty journals), AMA, NCCN, ACC, Cochrane, Wiley, and multiple specialty societies — each required individual institutional negotiation. The NEJM's naming of OpenEvidence as "best AI tool for medical information" reinforces the licensing relationship: NEJM now has a reputational stake in OpenEvidence's clinical performance. These institutions have a structural interest in OpenEvidence succeeding specifically because OpenEvidence does not train competing frontier models on their content — unlike OpenAI, Anthropic, and Google, all of which would be potential licensees with direct competitive conflicts.

What is being described as a moat but isn't

Citation-grounded RAG over medical literature is table stakes — every major competitor can demonstrate it. Hallucination reduction methods are differentiated today but will be closed by GPT-5 class models within 12–18 months. Physician brand trust is real but fragile, requiring no credibility incident to maintain. The USMLE 100% benchmark is impressive but tests a fundamentally different capability from the subspecialty reasoning physicians actually need.

Figure 7 — Defensibility heat map: OpenEvidence moat assets
Fig. 7. Asset defensibility tiers. Tier 3 (structural moat): journal licensing, physician behavioral dataset, verified prescriber identity. Tier 2 (time-limited): brand trust, subspecialty routing architecture, EHR workflow integrations. Tier 1 (already commodity): RAG architecture, drug interaction aggregation, hallucination reduction methods, business model structure.

The Google Ventures governance question

Google Ventures led both OpenEvidence's Series B and Series C. Google's MedLM directly targets the same physician workflow. GV board access creates potential information proximity to OpenEvidence's most sensitive and valuable asset — the physician behavioral query dataset. If GV's board materials include meaningful information about how that dataset is structured, queried, or monetized, the structural risk rating shifts from moderate to high. This is the single most important unresolved structural question in any assessment of OpenEvidence's strategic position, and no public information resolves it.

The two races OpenEvidence is currently running

Race 1 — EHR embed
Getting deep enough inside Epic clinical workflows — at 10+ major health systems — before Epic builds native clinical AI equivalents that make OpenEvidence optional. Current state: one confirmed major deployment (Mount Sinai). Timeline: Epic's Art agent released February 2026 and already adopted enterprise-wide by FMOL Health.
Race 2 — Open Vista revenue
Getting Veeva Open Vista to commercial revenue before hyperscalers achieve institutional physician distribution through their own healthcare products. Current state: first products expected 2026, as of April 2026 not yet generating reported revenue. Announced October 2025.

Overall strategic rating: Moderate Risk — Improving. The position is not yet durable. It becomes durable if both races resolve favorably. It becomes high risk if the Google Ventures governance question resolves unfavorably, if the EHR race stalls at pilot stage, or if physician consultation growth plateaus before the second revenue stream is material.

How durable is OpenEvidence, really? A structural assessment

As physicians and educators evaluating whether to trust, teach, or institutionally endorse this platform, the question of sustainability is not academic. A tool embedded in clinical workflows that becomes commercially compromised, acquired, or displaced by a better-funded competitor creates real disruption — to your trainees, your programs, and your governance obligations. What follows is an honest assessment of where OpenEvidence is strong, where it is fragile, and what signals to watch.

Why this matters to clinical faculty

The analysis below draws on a structured business evaluation framework used in technology investment. The reason it matters here is not because you are investors — it is because the platform's commercial incentive structure directly determines how it behaves in your clinical environment. A tool with a fragile business model or a compromised advertising relationship does not stay neutral. Understanding where the money comes from, and how durable it is, is part of responsible AI adoption.

Five dimensions of structural strength

A useful way to assess any AI platform's durability is to examine five structural dimensions: how much physicians trust it, how much contextual data it accumulates, how well it distributes to users, how differentiated its content is, and how it manages liability. OpenEvidence scores unevenly across these — and the gaps tell you something important about where the risks concentrate.

Trust — strong but borrowed. OpenEvidence's clinical credibility is high: 40%+ of U.S. physicians use it daily, it is HIPAA BAA-compliant, it is NPI-gated, and it is backed by Mayo Clinic as investor and partner. But the trust physicians place in its answers is largely a transfer from NEJM, JAMA, and NCCN — the sources it cites — rather than trust in OpenEvidence's own editorial judgment. This distinction matters: if a high-profile hallucination surfaces, or if the advertising model becomes publicly visible in a damaging way, the trust has no independent foundation to fall back on. It is strong today and fragile structurally.

Context — wide but shallow. The platform accumulates an extraordinary behavioral dataset — what physicians are uncertain about, at the moment of care, by specialty and query type. This is genuinely valuable. The limitation is that it does not own longitudinal patient data. The deep context moat — the actual EHR data — lives inside Epic and Cerner. OpenEvidence knows what your residents are asking. It does not know what happened to the patient afterward. That limits how far its clinical reasoning can evolve without an EHR partnership.

Distribution — the real asset. This is where OpenEvidence is structurally strongest. It became the default attention layer for U.S. physicians faster than any comparable platform in history. The free-to-physician model bypassed hospital procurement entirely, achieving 65,000+ new verified clinicians per month at zero institutional friction. Pharmaceutical CPMs of $70–$1,000+ confirm that the distribution is genuinely valued — pharma pays that premium because OpenEvidence delivers credentialed prescribers at the precise moment of clinical decision. No social media platform or general consumer health tool achieves this specificity. The vulnerability is that it was built on zero switching cost. Habit is not a contract.

Taste — curated but replicable. The platform's specialty-specific AI architecture and licensed journal curation are meaningfully better than open-internet AI tools. But taste in this context is largely an editorial architecture decision — it reflects the quality of source selection, not proprietary judgment at the point of output. A well-funded competitor with the same licensing agreements could replicate this approach. It is a real differentiator today with a closing window.

Liability — the most structurally thin dimension. OpenEvidence explicitly disclaims clinical responsibility. No licensed professional is accountable for the answers it generates. The platform's legal language is explicit: it "shall not be considered an endorsement" and outputs are "not a substitute for professional medical advice." This is the correct legal posture for the company — but it has a direct implication for clinicians. You carry the liability for any clinical decision made using this tool, regardless of what it told you. The platform's commercial incentive to disclaim liability aligns with your professional obligation to verify — but that alignment is not the same as protection.

Is it at risk of being displaced?

The direct answer is: yes, and the risk is more specific than the market appreciates. OpenEvidence is not a pure middleware play — the distribution position is real and the journal licensing creates genuine friction for replication. But the core product — AI-synthesized answers from medical literature — is exactly the capability that OpenAI Health, Google's MedPaLM successors, and Anthropic's enterprise health offerings are actively building toward. The model is not the moat. The physician habit layer and the journal licensing agreements are the moat.

The critical vulnerability is this: if a foundation model provider signs NEJM and JAMA — or simply negotiates the same content deals — and distributes through a channel physicians already trust (Epic's ambient AI, a GPT-4o health plugin, or a hospital enterprise contract), OpenEvidence's user base is one UX update away from erosion. The free model was brilliant for distribution and left switching costs at zero. That is the core tension.

The Google Ventures problem

Google Ventures led OpenEvidence's Series B and Series C funding rounds. Google's MedLM product directly targets the same physician workflow. GV board access means potential information proximity to OpenEvidence's most sensitive asset — the physician behavioral query dataset. This is the single most important unresolved structural question in any assessment of OpenEvidence's long-term independence. No public information resolves it. If you are at an institution considering an enterprise contract, this governance question should be on your due diligence list.

What OpenEvidence is likely to do next — and what it means for you

The following are the five most probable strategic moves OpenEvidence will make to entrench its position. Each is framed not as investment analysis but as a signal worth watching — because each move changes the nature of your relationship with the platform.

OpenEvidence is likely to build persistent physician profiles — specialty-verified query history, CME integration, peer benchmarking ("how does your prescribing pattern compare to similar oncologists?"). The stated purpose will be clinical utility: continuity, personalization, learning. The structural effect is that leaving the platform becomes costly because you lose your history. What to watch: any feature that makes your query log feel like a professional record. Once that data accumulates, it creates a switching cost that free alternatives cannot match — and it deepens the behavioral dataset OpenEvidence sells to pharma partners.

OpenEvidence already holds NEJM, JAMA, NCCN, ACC, and AMA licensing. The next tier is subspecialty society guidelines in high-liability fields — oncology, cardiology, nephrology — where the guideline is the standard of care. If these are locked as exclusive or first-look partnerships, a competitor AI cannot answer the same question with the same authority even if its model is better. What to watch: announcements from your specialty society about AI content partnerships. If your society's guidelines go exclusively to one platform, it matters for how you evaluate alternatives.

The current model shows pharmaceutical ads. The next model tracks whether the physician who queried "second-line EGFR therapy" actually ordered it, and sells that outcomes loop to pharma as closed-loop prescriber marketing intelligence. This is not speculative — it is the same model Doximity built with its prescriber data, generating $570M in annual revenue. What to watch: any feature described as "outcomes tracking," "post-query analytics," or "prescribing pattern benchmarking." If the platform can connect your query to your prescribing behavior, the commercial value of your usage increases dramatically — and the conflict of interest deepens correspondingly.

The free model captured individual physicians. The next layer is institutional embedding — where a CMO mandates OpenEvidence as the CDS tool and it appears as a line item in the operating budget. Enterprise contracts create contractual switching costs, audit trails, and institutional reporting that individual habit does not. What to watch: whether your institution is approached about an enterprise contract, and if so, what data-sharing provisions are embedded. An institutional contract with query-level reporting means OpenEvidence can see aggregate clinical uncertainty patterns across your entire physician workforce — which is operationally valuable to you and commercially valuable to them.

A horizontal product — one interface for all physicians — is easier to displace than 30 specialty-specific surfaces, each co-branded with the relevant professional society. If "OpenEvidence Oncology" is cited in tumor boards as the reference standard, or "OpenEvidence Cardiology" carries ACC co-branding, a competitor must unseat OpenEvidence in 30 subspecialty markets simultaneously rather than once. What to watch: specialty-specific product launches and society co-branding announcements. Each one represents both a genuine clinical improvement (more curated content) and a deeper entrenchment in that specialty's workflow.

The 10x model test: what happens when AI gets dramatically better?

A useful stress test for any AI platform is to ask: what survives when the underlying model gets ten times more capable — for free — from a frontier provider? For OpenEvidence, the answer is uncomfortable and worth understanding before your institution deepens its dependence on the platform.

What survives a 10x model upgrade: The physician identity graph — 40%+ of U.S. physicians verified and habituated — survives because it is about who the platform reaches, not how good the AI is. The pharmaceutical advertising channel survives for the same reason: CPM premium is a function of audience specificity, not model quality. The journal licensing exclusivity survives if it has been locked before competitors negotiate equivalent deals.

What gets threatened: The core query product — DeepConsult, the evidence synthesis engine, the clinical reasoning layer — is exactly what a GPT-5 class model will match or exceed without requiring OpenEvidence's proprietary architecture. Any differentiation OpenEvidence built on top of raw model capability is at risk of commoditization within 18–24 months.

The perverse incentive no one is talking about

OpenEvidence's advertising model creates a structural tension that a 10x model upgrade sharpens rather than resolves. A better model gives cleaner, faster answers — which reduces the surface area for ad placement per query. The product's growth and the business model's health may already be in tension: the more useful the AI becomes, the less time physicians spend browsing, and the fewer impressions pharma pays for. Watch whether OpenEvidence responds to this by increasing ad density, embedding ads more deeply in the answer rather than alongside it, or shifting toward outcomes-based pricing. Any of those moves would represent a material change in how commercial incentives intersect with clinical answers.

The bottom line for clinical educators: OpenEvidence is not going away in the next 12–18 months. Its distribution position is real, its content licensing is genuinely differentiated, and its physician adoption is too deep to unwind quickly. But it is not immune to displacement, and the commercial pressures that will intensify as it pursues profitability are directly relevant to the objectivity of the answers it returns. The appropriate institutional posture is: use it deliberately, verify consistently, and watch the business model as closely as you watch the benchmarks.

Section 12

Guidance for Medical Educators, Program Directors, and Trainees

Section at a glance
The tool is already in your residents' pockets. The question is whether your program has a deliberate framework for using it or is letting individual habit determine how clinical AI enters training. Three behaviors — sequence, verification, and PHI hygiene — determine whether OpenEvidence helps or harms trainee development.
When it helps trainees
  • Used after independent differential formulation — as a gap check, not a first answer
  • Citations traced to primary source — builds evidence appraisal skill rather than shortcutting it
  • Prompt engineering taught explicitly — better queries, better outputs, better learning
  • Used for common conditions where reliability is documented as high
When it harms trainees
  • Queried before the trainee has formed any independent assessment — automation bias in formation
  • Used for subspecialty or complex presentations without mandatory verification
  • PHI entered without confirmed BAA — direct legal and reputational exposure
  • AI-generated MDM rationale signed in coding without independent review
The three program-level requirements
  • Sequence rule: Require trainees to formulate differential first; query second — enforce this in teaching rounds
  • BAA confirmation: Post the institutional BAA status at every rotation site so trainees do not guess
  • Verification requirement: At least one cited source per AI-assisted plan must be read in full — not just the synthesis

For program directors and department chairs

Before endorsing institutional use or integrating OpenEvidence into a formal curriculum, department leadership should document answers to the following questions:

1. Does the advertising system display inside the EHR-embedded version of OpenEvidence at your institution? If you are at a site with an enterprise contract (e.g., if your system follows the Mount Sinai model), confirm whether pharmaceutical advertising appears in the enterprise workflow. This determines whether the conflict-of-interest analysis applies to your institutional deployment.

2. Has your institution executed a Business Associate Agreement with OpenEvidence? Without a BAA, trainees must treat the platform as a non-HIPAA-covered environment and strictly de-identify all clinical queries.

3. What is the data governance position for clinician query data? OpenEvidence's Privacy Policy permits individual query data to be used for product improvement for non-BAA users. Confirm whether your institutional BAA (if it exists) restricts this use.

4. Has your P&T committee or compliance team reviewed the advertising-content separation claim? At a minimum, document that the claim has been reviewed and the Practice Fusion precedent has been considered.

Why this matters — the technology framework

OpenEvidence is not a passive reference tool. It is a retrieval-augmented generation system that takes your query, searches a licensed corpus of 35 million publications, retrieves semantically relevant chunks, and synthesizes a response using a large language model. That architecture has a specific failure geometry: it is calibrated on the distribution of medical literature, which means it performs well on the center of that distribution — common conditions, well-studied drugs, published guidelines — and poorly at the tails, which is precisely where clinical education is most consequential. Residents encounter tail cases. Board examinations test tail cases. Complex inpatients are tail cases.

The deeper pedagogical issue is that the model does not know when it is at a tail. It generates confident, well-formatted, citation-supported text regardless of whether the retrieval surface was rich or sparse. A trainee who has not yet learned to distinguish a confident synthesis of strong evidence from a confident synthesis of weak evidence cannot detect this from the output alone. They must go to the source. The skill of going to the source — and of knowing when to do so — is exactly what residency training is supposed to build. OpenEvidence, if misused, shortcuts precisely that skill.

There is a second technology-driven concern specific to teaching: automation bias compounds faster in junior learners than in experienced clinicians. An attending physician who encounters an OpenEvidence answer that contradicts their clinical gestalt will push back. A PGY-1 who has not yet developed a clinical gestalt has no internal counterweight. The AI answer becomes the reference point, not the check against one. This is the deskilling mechanism — not that the tool gives wrong answers often, but that it removes the productive uncertainty through which clinical pattern recognition develops.

How to build the framework into teaching

The pedagogical sequence that preserves clinical reasoning while capturing the tool's genuine utility is sequential, not concurrent. Require trainees to formulate a complete differential diagnosis and initial management plan independently before querying the platform. Then use OpenEvidence to audit that plan — checking for guideline updates, rare etiologies the trainee may have omitted, or recent trial data that changes a standard recommendation. This sequence is not a workaround. It is epistemically correct: the AI functions as a fast literature check on a hypothesis already formed, not as the origin of the hypothesis.

For evidence appraisal specifically, require residents to trace at least one OpenEvidence citation per query to the primary source. They should read the methods section and ask: What was the study population? Does it include patients like mine? What was the comparison arm? Was this a pre-specified analysis or a subgroup? This is not a burdensome requirement — it takes five minutes. But it converts the platform from an answer machine into a navigation tool for the primary literature, which is what evidence-based medicine training requires.

For teaching the technology framework itself, consider introducing OpenEvidence explicitly as a RAG system in orientation. Explain that it retrieves from a corpus, synthesizes with a language model, and links citations deterministically — but that citation presence does not guarantee citation accuracy or clinical applicability. Show trainees the BiPAP case: the platform retrieved a real trial, cited it accurately, and presented its cohort-specific parameters as universal recommendations. Ask them to find the gap. That exercise teaches more about clinical AI literacy than any policy document.

Three concrete curriculum structures

Daily rounds structure: Before morning rounds, residents prepare presentations independently — differential, assessment, plan — without AI tools. After presentations, the team uses OpenEvidence together to check for guideline currency on one key question per patient. The attending frames this as "let's see what the literature says" not "let's see what the AI says." This framing matters: it keeps the tool in the role of literature retrieval, not clinical authority.

Journal club integration: Assign residents to query OpenEvidence on the journal club paper's clinical question before reading the paper. Then read the paper. Compare what the AI synthesized to what the actual trial found. This exercise reliably surfaces FM-4 (guardrail miscalibration) — the platform often synthesizes prior literature and misses the nuance of the paper being discussed, even when that paper is in its licensed corpus.

Subspecialty rotation structure: At the start of any subspecialty rotation, have the fellow or attending generate five "tail case" queries — complex, atypical, or rare presentations from their specialty's boards. Run them through OpenEvidence. Review the accuracy together. This calibrates the trainee's trust in the tool for that specialty before they use it independently in clinical decision-making.

For residents, fellows, and students

OpenEvidence is a high-speed medical librarian, not an attending physician. Every output should be treated as a starting point for verification, not a definitive answer. The platform's citation links exist specifically so you can click through to the primary source. Use them, especially for:

  • Any dosing or drug interaction recommendation
  • Any guideline recommendation that will change your management
  • Any query about a recently approved or recently updated treatment
  • Any subspecialty query involving a complex or atypical presentation

Remember: the platform scored 21–34% on complex subspecialty board questions. That means the answer on the screen is wrong roughly one time in four at the subspecialty level — and the platform will not tell you when that is the case.

Unless you are accessing OpenEvidence through an enterprise-contracted HIPAA-covered environment at your rotation site (confirmed, not assumed), the following rules apply:

  • Do not enter any patient name, date of birth, MRN, or other identifying information into OpenEvidence queries.
  • Do not copy clinical notes or problem lists into the search field without removing all identifiers.
  • Use clinical descriptors only: "63-year-old male with CKD stage 3 and newly started NSAID" — not the patient's name and MRN.

The OpenEvidence Privacy Policy states that PHI submitted through non-BAA individual accounts is deemed unintentional and is the sole responsibility of the user. This is not a technicality. This is your liability as a trainee.

Query phrasing substantially affects output quality. Broad, open-ended questions give the model more room to interpolate — including into domains where evidence is sparse. Constrained, specific queries produce more reliable results.

Better: "Summarize only the 2025 ACC/AHA guideline recommendations for anticoagulation in non-valvular atrial fibrillation in a patient with CKD stage 4"

Worse: "What anticoagulant should I use in a patient with A-fib and kidney disease?"

Including specific temporal parameters (guidelines updated in the last 2 years), specific society names (NCCN, ACC, CHEST), or specific study design preferences (RCT only, systematic review only) constrains the retrieval and reduces the risk of the model synthesizing across evidence of widely varying quality.

A practitioner who trained using UpToDate's full-text, manually curated, evidence-cited articles followed PubMed links, read methodology sections, understood why a guideline recommends what it recommends, and developed the habit of sitting with diagnostic uncertainty before resolving it. OpenEvidence compresses that process to 30 seconds.

For an experienced attending managing a common condition, this is efficiency. For a PGY-1 building clinical reasoning frameworks for the first time, this is a potential shortcut through the cognitive work that builds judgment. The Cambridge Health Alliance study found OpenEvidence primarily reinforced pre-existing physician hypotheses rather than redirecting clinical reasoning — which means it is unlikely to catch a wrong differential you've already committed to.

The question to ask yourself every time you open OpenEvidence is: Am I using this to check my thinking, or to replace it? The first is appropriate. The second is where the deskilling risk lives.

Section 13

Unresolved Questions

Section at a glance
Eight questions have no publicly available answer as of April 2026. These are not gaps in this report — they are gaps in what the company and the research community have published. Institutions making deployment decisions must either obtain answers through direct vendor engagement or accept those uncertainties as residual risk.
Questions answerable by the vendor
  • Does pharma advertising display inside the Epic-embedded enterprise version?
  • What are the exact terms of the Mount Sinai and Sutter Health EHR agreements?
  • What is the BAA scope relative to patient-context-aware queries inside EHR workflows?
  • Is there an independent audit of the advertising-content separation claim?
Questions that require independent research
  • Does GV board access create information exposure around the behavioral dataset? (structural, not vendor-disclosable)
  • Does OpenEvidence use shift prescribing behavior in aggregate? (requires longitudinal study)
  • Does OE use impair evidence appraisal skill in trainees over time? (requires prospective GME study)
  • Does Coding Intelligence MDM rationale meet AMA complexity criteria deterministically? (requires compliance audit)
What your institution should do
  • Request written answers to the four vendor-answerable questions before any enterprise contract
  • Flag the prescribing behavior and trainee deskilling questions to your GME research committee — these are publishable studies
  • Document the unresolved questions explicitly in your AI governance review — accepting known unknowns is different from ignoring them

The following questions are not answered by publicly available information as of April 2026:

QuestionWhy it mattersGap type
Does pharmaceutical advertising display inside the Epic-embedded version at Mount Sinai — and by extension, at any enterprise deployment? Determines whether the ad-enterprise contradiction has been resolved in practice or just in theory Governance
What are the contractual terms of the Mount Sinai and Sutter Health EHR agreements? Revenue share, exclusivity, feature scope, and term length determine how deep the workflow ownership actually is Commercial
Has Google Ventures' board access created any information exposure around the physician behavioral query dataset? Described as the most important unresolved structural question in the analysis. If GV board materials include meaningful dataset information, the risk rating shifts from moderate to high. Governance
Has the advertising system been independently audited for separation from clinical response generation? The company's self-attestation is insufficient at $12B valuation and 760K physician users Regulatory
What is the actual chunking strategy for medical literature — particularly for structured documents like NCCN algorithms? Chunking is the most consequential RAG design decision and the most common source of retrieval failure. Not publicly described. Technical
Are Coding Intelligence MDM rationale outputs validated against AMA MDM complexity criteria deterministically? A hallucinated MDM rationale that reads like a real one will pass every citation guardrail while potentially constituting a compliance violation Clinical / compliance
What are the longitudinal effects of OpenEvidence use on prescribing behavior? 20M+ monthly consultations with query-level data on what drugs physicians ask about at the moment of prescribing consideration. This data exists inside OpenEvidence and has not been published. Public health
Does OpenEvidence use affect evidence appraisal skill development in trainees longitudinally? At 40%+ of U.S. physicians using it daily — many of whom are residents — this is a medical education infrastructure question with no published prospective data Medical education
Table 7. Material unresolved questions as of April 9, 2026. These are not gaps in this report — they are gaps in publicly available information.
Section 14

RAG Architecture: A Technical Assessment

Section at a glance
OpenEvidence looks stronger than most competitors on raw retrieval quality — but low-80s on a medical QA benchmark is not a number to cite uncritically in a clinical setting. The key RAG metrics — recall, citation precision, source-selection quality — have not been published. The graph RAG claim is architecturally interesting but not validated by performance data. Without those numbers, "the AI is grounded in peer-reviewed literature" is a design claim, not a clinical guarantee.
Where the architecture helps you
  • Domain-specific training means clinical terminology is interpreted correctly (e.g., "significant" = statistical, not colloquial)
  • Computer vision for figures and tables — can retrieve forest plots and treatment flowcharts, not just prose
  • Deterministic citation linking prevents unsourced generation for common queries
  • Graph RAG — if working as claimed — handles multi-concept clinical queries better than vector-only retrieval
Where the architecture has gaps
  • Chunking strategy for structured documents (NCCN decision trees, Cochrane GRADE tables) is not publicly described
  • No published recall@k, citation precision, or source-selection quality metrics
  • Graph RAG performance on MedXpertQA (34%) does not match the multi-hop capability claimed
  • Both OE and competitors are closed systems — independent technical audits do not exist
What this means when you use it
  • Treat architecture claims as design statements, not clinical validation
  • For queries crossing multiple clinical domains (e.g., genetic marker → drug metabolism → comorbidity presentation), require primary source verification regardless of how confident the synthesis reads
  • A well-cited answer is not the same as a correctly sourced and applicable answer — read the cited paper, not just the synthesis

This section applies the RAG evaluation framework to OpenEvidence — not as background reading, but as an evaluative framework. The core question is not whether OpenEvidence uses RAG. Every major competitor does. The question is how it is implemented across chunking strategy, embedding design, retrieval depth, and citation verification — and where the public evidence is too thin to make confident claims.

Analyst framing

Judging on raw RAG and retrieval quality, OpenEvidence looks more relevant than most alternatives in this space. Low-80s accuracy on an end-to-end medical QA benchmark is respectable. But in a clinical setting, that number alone is not sufficient grounds for unqualified confidence — especially without published retrieval metrics such as recall at k, citation precision, or source-selection quality. UpToDate is stronger as a curated editorial reference, but it is not exposing a retrieval system in the same way. These are different architectures solving different problems.

What RAG actually means in this context

Standard retrieval-augmented generation converts a user query into a high-dimensional vector embedding, searches a database for semantically similar text chunks, retrieves those chunks, injects them into a large language model prompt, and generates a synthesized response. The quality of the output depends almost entirely on three design decisions that happen before the language model sees anything: how the source documents are chunked, what embedding model converts text to vectors, and how the retrieval step selects which chunks to surface.

In medical literature, each of these decisions is non-trivial. A NEJM randomized controlled trial report mixes methods, baseline characteristics, results tables, subgroup analyses, and discussion in a structure that does not naturally align with fixed-size chunking. An NCCN treatment algorithm is a decision tree, not prose — chunking it by token count destroys the conditional logic ("if HER2+ and prior anthracycline exposure, then...") that makes the guideline clinically useful. A Cochrane systematic review has explicitly graded evidence quality (GRADE methodology) in a structured summary format that carries more clinical weight than the narrative text. Splitting any of these at arbitrary boundaries — the "torn textbook problem" in RAG literature — is the most common source of retrieval failure and subsequent hallucination.

Figure 8 — RAG quality dimensions: what is publicly documented vs. what remains a gap
RAG design dimensionWhat OE claimsWhat is independently documentedGap assessment
Chunking strategy Graph-based retrieval with knowledge graph traversal (SystemAI); computer vision for figures and tables Not independently described. The computer vision claim for figures/tables implies structure-aware chunking. Material gap — chunk boundary design for NCCN decision trees, Cochrane GRADE summaries, and trial subgroup tables is undisclosed
Embedding model Domain-specific models trained on licensed medical texts Not independently verified. Claim is plausible given training corpus. Moderate gap — domain-specific embeddings matter for terms like "significant" (statistical vs. colloquial) and "negative" (test result vs. bad outcome)
Retrieval depth / multi-hop Graph traversal enables multi-hop reasoning across documents not explicitly linked in any single source MedXpertQA performance (34% on complex subspecialty) suggests multi-hop capability has real-world limits Claim-evidence gap — the architecture implies more than performance data supports
Citation verification Deterministic citation linking — answers rejected if not properly sourced Confirmed by multiple independent clinical evaluations. Responses consistently include inline citations. Well documented — but citation presence ≠ citation accuracy (see FM-4)
Recall and precision metrics Not published No independent evaluation of recall@k, citation precision, or source-selection quality exists in public literature Not evaluable — standard retrieval metrics are absent from all public reporting
End-to-end accuracy 100% USMLE; low-80s on medical QA benchmarks cited by company Independent: 34% on MedXpertQA (complex subspecialty); high-80s on simpler QA in published peer review Benchmark-dependent — performance varies significantly by difficulty and domain specificity
Fig. 8. RAG architecture assessment. Both systems described in this table — OpenEvidence and the competition — are closed systems. Public evidence is thin on all dimensions except citation behavior and end-to-end accuracy on published benchmarks.

The graph RAG claim examined

The most architecturally interesting element is what OpenEvidence calls SystemAI — a graph-based retrieval layer that maps relationships between biomedical entities (diseases, phenotypes, drugs, biological pathways) and traverses those relational pathways to aggregate evidence across multiple documents. In standard vector RAG, if a physician asks about the clinical significance of a specific CYP2D6 polymorphism on a drug's metabolism and its downstream effect on a comorbidity, the system struggles unless all three concepts appear together in a single source document. Graph RAG is specifically designed to close this multi-hop gap by traversing entity relationships rather than searching for vector similarity alone.

This is a real architectural distinction with real clinical value — if implemented correctly. The concern is that MedXpertQA performance at 34% on complex subspecialty scenarios is precisely the domain where multi-hop graph traversal should provide the most benefit. Either the graph structure is not yet dense enough in the tail-case subspecialty domains tested, or the benefit exists but does not close the gap sufficiently on questions designed to require cross-document reasoning. The files available do not resolve this question, and the company has not published the retrieval-layer architecture in sufficient detail to evaluate it independently.

What the retrieval quality evidence actually supports

The Cambridge Health Alliance prospective study — the most rigorous independent clinical evaluation in the public record — found that OpenEvidence scored well on clarity, relevance, and evidence-based support, but had low impact on altering clinical decision-making. This pattern is consistent with a retrieval system that surfaces the right literature accurately for common queries but does not meaningfully expand clinical reasoning for complex or ambiguous cases. That is not a failure; it is an accurate characterization of what the tool currently does well. The problem arises when the tool is used as if its capabilities extend to the complex-reasoning tier — which is where the architecture claims live.

Where the retrieval quality genuinely shines

For the "long tail" of medical literature — niche queries about rare presentations, recently published guideline updates, drug interactions in uncommon patient populations — OpenEvidence's access to 35 million licensed publications gives it a genuine advantage over curated editorial tools that may not have updated their monographs yet. A UpToDate author team takes weeks to months to incorporate a major new guideline. OpenEvidence can surface the guideline text the day it is published. This is real clinical value for a specific and important use case.

Section 15

OpenEvidence vs. UpToDate Expert AI: Fit-for-Purpose, Not Replacement

Section at a glance
These tools do not compete for the same clinical moment. OpenEvidence is better when you need speed and recent literature. UpToDate Expert AI is better when you need editorial provenance and the assurance that an expert explicitly distinguished evidence from opinion. The cost difference is real: free vs. $530/year. The conflict-of-interest difference is also real: none vs. pharmaceutical advertising at the point of prescribing.
Use OpenEvidence when
  • A major guideline was updated in the last 6 months and you need the new recommendation now
  • The clinical question involves a rare presentation or long-tail literature query unlikely to be in a curated database
  • Speed matters more than editorial provenance — point-of-care rapid synthesis for common clinical decisions
  • Cost is a constraint — free access at LCMHS or in community settings where UpToDate is not available
Use UpToDate Expert AI when
  • You are making a high-stakes management decision where the recommendation needs to survive attending scrutiny
  • You need to know where evidence ends and expert opinion begins — UpToDate authors label this explicitly
  • Your institution has a P&T or compliance concern about pharma advertising adjacent to clinical queries
  • The query is about well-established standards of care for common conditions — UpToDate's editorial depth is an advantage here
The practical rule for LSU trainees
  • If your rotation site provides UpToDate access, use both tools deliberately — not interchangeably
  • Use OE to check whether a guideline has been updated since the UpToDate monograph was last revised
  • Use UpToDate Expert AI for the management plan you will defend to an attending on morning rounds
  • Neither tool eliminates the need to read a primary source for complex or atypical presentations

UpToDate Expert AI is a generative conversational interface built exclusively on UpToDate's own expert-authored, peer-reviewed content repository. It is not trained on the open web, does not draw from raw journal databases, and does not expose a retrieval system in the same way OpenEvidence does. It applies a generative AI layer on top of human-curated clinical summaries that UpToDate has been building for 30 years.

These are two fundamentally different architectures solving the same surface-level problem — answering clinical questions quickly — but with different underlying assumptions about where accuracy lives.

DimensionOpenEvidenceUpToDate Expert AI
Knowledge base 35M+ licensed peer-reviewed publications (NEJM, JAMA, Cochrane, NCCN, Wiley, specialty societies) — raw literature, not curated summaries UpToDate's own expert-authored content library — 30+ years of physician-authored, peer-reviewed clinical summaries explicitly distinguishing evidence from expert opinion
AI architecture Dynamic RAG over live literature corpus; graph-based retrieval; agentic reasoning (DeepConsult). Answers can surface literature published days ago. Generative AI layer over a static (update-cycle-dependent) curated corpus. Answers reflect the quality of UpToDate's editorial process, not real-time literature.
Evidence currency Can surface a new guideline or trial the day it is published in a licensed journal Currency depends on UpToDate's editorial update cycle — weeks to months for major guideline revisions
Conflict of interest Pharmaceutical advertising displayed alongside clinical queries. No independent audit of content-ad separation. No advertising. Subscription model. Human authors explicitly disclose where evidence ends and expert judgment begins.
Cost Free for NPI-verified U.S. clinicians ~$530/year individual (U.S.); enterprise institutional pricing
Where it is better Recent guideline synthesis; niche/rare literature searches; speed; edge cases where evidence exists in literature but not yet in curated databases; free access for under-resourced settings Standard of care for common conditions; deep clinical reasoning with editorial provenance; institutional governance; no advertising conflict; explicit "expert opinion" labeling; 30 years of physician trust
Hallucination risk Lower than general-purpose LLMs due to citation grounding; but citation presence does not guarantee clinical accuracy (see FM-4) Claims elimination of hallucination by confining answers to curated content — plausible but still dependent on whether the curated content covers the query
Who should use it Any clinician needing fast, cited synthesis at the point of care — especially for recent guidelines or long-tail literature queries. Requires verification for subspecialty or complex presentations. Attending physicians making high-stakes management decisions; institutions needing editorial accountability; settings where the cost of error is highest
Table 8. OpenEvidence vs. UpToDate Expert AI — head-to-head comparison. Both are closed systems; independent benchmarking of UpToDate Expert AI against OpenEvidence on the same question sets has not been published as of April 2026.

Will dynamic RAG systems replace curated editorial trust?

The question is not which system will win. It is which system is reliable enough to trust for a specific clinical task — point-of-care synthesis, rapid literature exploration, or high-stakes decisions where perfectly governed editorial data matters more than speed.

The honest answer, based on available evidence, is that these tools occupy different reliability zones rather than competing for the same clinical moment. For a hospitalist who needs to know what the most recent ACC guidance says about anticoagulation after left atrial appendage closure in a patient with CKD stage 4, OpenEvidence can surface the answer in seconds with citations. For an attending deciding whether to escalate immunosuppression in a patient with complex inflammatory bowel disease and concurrent infection, UpToDate's expert-authored synthesis — which explicitly separates evidence from editorial judgment — provides a different kind of assurance that RAG over raw literature does not yet replicate.

Very high accuracy in this space always costs something. For UpToDate, the cost is money and editorial lag time. For OpenEvidence, the cost is the pharmaceutical advertising model and the conflict-of-interest architecture that comes with it. Systems that are very good at synthesis will still have edge-case failure modes; the real question is where each one is reliable enough to trust and where human verification remains non-negotiable.

Figure 9 — Use-case reliability zones: where to trust each tool
Fig. 9. Illustrative reliability assessment by use case. Neither tool has published rigorous outcome data in the "high complexity / high stakes" quadrant. Use case fit should guide tool selection, not brand preference.
For LSU trainees specifically

If your program or rotation site provides UpToDate access, use both tools deliberately. OpenEvidence is better when you need fast synthesis of recent literature or want to know if a new guideline has been published since the UpToDate monograph was last updated. UpToDate Expert AI is better when you need the editorial provenance of a recommendation — especially when making a management decision that will require you to defend it to an attending. For high-stakes or complex presentations, verify OpenEvidence outputs against the primary source before acting. This is not a criticism of the tool; it is the appropriate epistemics for any AI-assisted reference at the current state of the technology.

Section 16

The Business Model: The Doximity Playbook Explained

Section at a glance
The tool is free because pharmaceutical companies pay to reach you. That is not a criticism — it is the mechanism. Doximity built the same model and generated $570M in revenue. Understanding it is what separates informed clinical AI adoption from naive adoption. The Doximity Playbook has a ceiling: enterprise health systems will not accept pharma advertising inside their clinical workflows. OpenEvidence must resolve this before it can scale to the institutional revenue the $12B valuation implies.
What the model enables for clinicians
  • Free access for all NPI-verified clinicians — including those in under-resourced community settings
  • No institutional procurement barrier — a physician in a rural Louisiana practice gets the same tool as a Mount Sinai attending
  • Ad revenue cross-subsidizes features (DeepConsult, Visits, Doctor Dialer) that physicians benefit from directly
  • The Veeva/Open Vista move monetizes behavioral data in a direction (pharma commercial) that is somewhat aligned with clinician interests (better trial matching, drug discovery)
What the model creates structurally
  • Pharma advertising at $70–$1,000+ CPM displayed to verified prescribers at point-of-care = structural conflict of interest by design
  • The Practice Fusion precedent means institutional compliance teams will flag this model
  • Free-tier and enterprise-tier models are structurally incompatible at scale — one requires pharma ads, one requires their absence
  • Reading OE's announcements chronologically shows each product launch serving both clinical value and monetization — these are not separable
What LSU clinicians should understand
  • You are the product in the traditional sense — your verified prescriber attention is what is being sold
  • This does not make the answers wrong, but it means you should notice what drug ads appear alongside which queries
  • Ask your department: has anyone documented what advertising we see when querying the platform during rounds?
  • If your program is considering an enterprise contract, ask explicitly whether the enterprise tier removes all pharmaceutical advertising

OpenEvidence follows a strategy that analysts call the "Doximity Playbook." Understanding this model is not optional for clinicians at academic medical centers — it directly affects how you should interpret the platform's incentives, whose interests are being served when you use it, and what the long-term trajectory of the service looks like.

How the model works

Figure 10 — The Doximity Playbook: value flow diagram
Fig. 10. The Doximity Playbook operates as a three-sided market. OpenEvidence builds physician trust and attention via a free, high-quality clinical tool. That verified physician attention is then monetized by selling advertising access to pharmaceutical companies at premium CPMs.

The playbook has three moves:

Move 1: Build the audience. Create a genuinely useful free tool for a hard-to-reach, high-value audience — in this case, NPI-verified U.S. prescribers. Distribute it directly to physicians, bypassing the 18-month hospital IT procurement cycle entirely. The tool must be good enough that physicians choose to use it voluntarily, not because their institution told them to. OpenEvidence achieved this — 40%+ of U.S. physicians use it daily because it makes their lives easier, not because of a contract.

Move 2: Monetize the attention. Once you have a verified, credentialed audience at the exact moment of clinical decision-making, pharmaceutical companies will pay extraordinary prices for access. OpenEvidence's CPMs of $70–$1,000+ compare to $5–15 for consumer social media because the context is unique: a verified prescriber is asking a clinical question that may directly inform a prescribing decision within the next 60 seconds. This is not display advertising in the traditional sense — it is advertising at the highest-intent moment in medicine.

Move 3: Use the free tier as a wedge to enterprise. Once physicians love the free tool, hospital Chief Financial Officers and IT administrators are willing to pay for enterprise contracts that embed the tool system-wide with HIPAA-covered enterprise governance. The Mount Sinai deployment is this move — the free tool became the proof of concept; the enterprise contract is the business.

Model componentOpenEvidenceDoximity (original playbook)UpToDate (traditional model)
Access model Free for NPI-verified U.S. clinicians Free for NPI-verified U.S. physicians ~$530/year individual; institutional enterprise
Primary revenue Pharma/device advertising at $70–$1,000+ CPM Pharma advertising at $228 ARPU; $570M TTM revenue Per-seat subscriptions; $595M revenue
Physician verification NPI verification — 760K+ registered NPI verification — 2M+ registered (80%+ of U.S. physicians) Institutional subscription — user identity less granular
Revenue per user (ARPU) ~$124 ~$228 ~$198 (estimated from $595M / ~3M users)
Enterprise upsell Health system EHR contracts (Mount Sinai model) Pharma marketing solutions, telehealth Core product is enterprise — no upsell required
Advertising conflict High — pharma ads displayed alongside clinical decision queries Moderate — pharma ads in professional network context None — subscription model, no advertising
Table 9. Business model comparison. Doximity's model is the closest historical precedent for OpenEvidence's strategy. Both follow the "build physician trust, monetize physician attention" architecture.

Why this matters for LSU clinicians and trainees

The Doximity Playbook creates a structural reality that is worth stating plainly for trainees: OpenEvidence's revenue depends on pharmaceutical companies paying to reach you at the precise moment you are making clinical decisions. This does not mean the clinical answers are wrong. It does not mean the content is sponsored. The company states the content and advertising systems are separate. What it means is that the business model requires this structural proximity to exist — and that proximity creates a conflict-of-interest architecture that no amount of technical separation fully eliminates.

For an academic medical center clinician, the relevant question is not whether OpenEvidence's individual answers are biased. It is whether the systematic exposure to pharmaceutical advertising at clinical decision moments — repeated hundreds of times per month across 760,000 physicians — shifts prescribing behavior in aggregate, even slightly, even subconsciously. This is not a hypothetical that behavioral economics can easily dismiss.

The Practice Fusion precedent — read this

In 2019, Practice Fusion — a clinical decision support EHR company — paid a $145 million DOJ settlement after it was found to have accepted payments from an opioid manufacturer in exchange for building clinical decision support alerts that recommended extended-release opioids during patient encounters. The alerts were not labeled as sponsored. Physicians did not know their CDS was influencing them toward a specific manufacturer's product. OpenEvidence is not accused of anything comparable. But the Practice Fusion case established the legal and reputational framework within which any academic medical center's compliance and legal teams will review a pharma-advertising-adjacent clinical AI tool. This is not theoretical risk management — it is the direct precedent that institutional lawyers cite.

The announcements page as a business model readout

Reading OpenEvidence's official announcements chronologically from 2024 to April 2026 reveals the business model evolution in real time:

2023–2024 · Phase 1: Build physician trust
Core search engine + free access
First announcement: "Build evidence-based AI for doctors." Series A ($75M). NEJM/JAMA content partnerships announced simultaneously — content quality is the trust signal that makes free access sustainable because it enables the advertising premium.
July 2025 · Phase 2: Deepen the moat
DeepConsult + Series B ($210M)
DeepConsult positions OE as more than a search engine — a PhD-level reasoning agent. This deepens the tool's clinical value and physician stickiness while the advertising revenue compounds.
August–October 2025 · Phase 3: Expand workflow surface area
Visits (ambient scribe) + ACC partnership + Veeva Open Vista
Visits moves OE from search tool to workflow tool — more time in the platform means more advertising inventory. The ACC partnership is both a content moat and a credibility signal. Veeva Open Vista is the first move toward monetizing the behavioral data asset directly to pharma.
Jan–Feb 2026 · Phase 4: Enterprise conversion
Series D ($12B) + Sutter Health EHR + Doctor Dialer
Series D at $12B uses physician scale as leverage for enterprise contracts. Sutter Health is the first major EHR embed. Doctor Dialer adds communications — creating a full-stack physician workflow product that hospitals can license.
March–April 2026 · Phase 5: Revenue cycle monetization
Wiley + NORD + Coding Intelligence + Mount Sinai + Tandem + Dotflows
This is the pivot from "time saved" to "revenue generated." Coding Intelligence and Tandem make the ROI case to hospital CFOs. Mount Sinai is the proof point. Dotflows deepens platform stickiness by making the tool personalizable. Wiley and NORD expand the content moat.
Section 17

Key Risks: Expanded Analysis

Section at a glance
Five risk categories determine whether OpenEvidence remains a safe and durable clinical tool over the next 18–36 months. For clinicians and trainees, the most immediately actionable risks are accuracy (predictable tail-case failures), regulatory (PHI hygiene), and diagnostic deskilling. For program directors and administrators, the business model integrity and competitive durability risks determine whether institutional investment is defensible long-term.
Risks you can actively manage today
  • Accuracy risk: predictable and manageable with the scenario-type table in this section
  • PHI risk: entirely within your control — de-identify all queries on non-BAA accounts
  • Coding compliance risk: review AI-generated MDM rationale before approving any E&M code
  • Deskilling risk: sequence your use (differential first, OE second) — this is a behavioral habit, not a technology fix
Risks that require institutional action
  • Advertising-trust incident risk: requires P&T review and documented institutional position
  • Epic competitive pressure: requires monitoring Epic's Cosmos AI and Art agent roadmap quarterly
  • GV governance risk: requires vendor disclosure before any enterprise contract
  • FDA reclassification: requires legal review if your institution integrates agentic features into clinical workflows
The three things to do this week
  • Trainees: Confirm your BAA status at each rotation site — do not assume enterprise coverage
  • Residents/fellows: Test the tool on a known-difficult case from your specialty boards; document how it fails
  • Program directors: Add OpenEvidence to your next AI governance committee agenda with the advertising-content separation question as the primary item

This section consolidates the key risks identified across this report, organizing them into five categories relevant to different audiences: clinical faculty evaluating the tool, program directors designing curriculum around it, hospital administrators considering enterprise deployment, trainees using it daily, and anyone tracking the platform's long-term sustainability.

Risk 1 — Accuracy and patient safety

The accuracy risk is not uniform. It follows a specific pattern: the platform is reliable for common clinical queries in well-represented domains, and unreliable in specific, predictable ways at the tails. The failure is not random noise — it is systematic overconfidence in domains with thin training signal.

Figure 11 — Accuracy risk by clinical scenario type
Scenario typeReliabilityWhyVerification required?
Common chronic disease management (HTN, T2DM, CAD) High Well-represented in training; abundant published evidence; standard guidelines well-indexed Spot-check citations; acceptable for workflow use
Standard guideline lookup (ACC/AHA, CHEST, NCCN) High Licensed source content; guideline structure well-suited to retrieval Check guideline publication date — verify it is current edition
Recent literature (published within 6–12 months) High relative advantage Licensed corpus updates faster than curated editorial tools; genuine advantage over UpToDate here Verify source is final published version, not preprint
Drug interactions, common dosing Moderate Well-documented interactions are reliable; rare or poorly-documented interactions may be missed or misrepresented Always cross-reference a dedicated drug interaction database (Lexicomp, Micromedex) for high-risk combinations
Complex subspecialty presentations (board-level) Low (34% on MedXpertQA) Thin training signal at distribution tails; multi-hop reasoning limited; model does not identify when it is uncertain Mandatory specialist consultation or primary literature review
Rare disease / orphan condition Variable NORD partnership (March 2026) improves rare disease coverage; evidence base inherently sparse Treat as exploratory — verify with specialist or disease registry
Pediatric dosing, obstetric management Moderate-low Pediatric and obstetric populations are routinely excluded from the RCTs that dominate the training corpus Always verify against pediatric-specific or obstetric-specific references
Coding (E&M level, CPT) Moderate — tail risk is high-stakes Common codes reliable; rare codes and high-complexity MDM assignments are vulnerable to FM-1 and FM-2 High-confidence coding outputs in rare code territory should route to human review before claim submission
Fig. 11. Accuracy risk by scenario. The pattern is consistent: reliability is highest where training data is densest, lowest where it is sparsest — and the platform does not tell you which situation you are in.

Risk 2 — Competitive pressure and platform durability

Three simultaneous competitive threats could degrade OpenEvidence's position within 18–36 months:

Epic's native AI (Art agent). Epic released AI Charting in February 2026 and FMOL Health signed an enterprise license within weeks. Art provides ambient documentation, note drafting, and order suggestions natively inside Epic Hyperspace — without requiring a separate application. If Epic extends Art's capabilities to include evidence synthesis drawing on its Cosmos dataset (260M+ patient records, 8B+ encounters), OpenEvidence's EHR-embedded value proposition is directly threatened. Epic is simultaneously the primary distribution channel OpenEvidence needs and its most credible long-term competitor.

General-purpose frontier models with healthcare deployments. ChatGPT Health (OpenAI) and Claude for Healthcare (Anthropic) are HIPAA-compliant and targeting physician workflows. They run on public data (PubMed) rather than OpenEvidence's licensed NEJM/JAMA corpus, which is the structural buffer today. The buffer narrows if frontier labs negotiate their own journal licensing agreements — a possibility the source documents flag as a monitored risk but not yet an observed event.

UpToDate Expert AI. UpToDate has 3 million users, deep EHR integration, 30 years of physician trust, no advertising conflict, and has now deployed a generative AI interface on its own curated corpus. For physicians at institutions with UpToDate enterprise licenses, the marginal utility of OpenEvidence narrows — particularly for high-stakes clinical decisions where editorial provenance matters.

Risk 3 — Monetization, physician trust, and the advertising conflict

This is the risk that most institutional compliance teams will flag first and that OpenEvidence's investor narrative addresses least directly. Three sub-risks are worth separating:

3a. Trust erosion from a single incident. The company's $12 billion valuation is priced on physician trust staying intact. If a credible investigative report, regulatory inquiry, or peer-reviewed publication demonstrates a statistically significant association between pharmaceutical advertising exposure on OpenEvidence and prescribing behavior — even a small, directional effect — the trust premium collapses. This is not a theoretical scenario; it is exactly the question that the longitudinal prescribing behavior data inside OpenEvidence could answer and has not published.

3b. The Outcome Health structural parallel. Outcome Health, a pharma-ad-supported clinical decision support company, saw its founders face criminal charges for fraudulent ad metrics. OpenEvidence is not accused of comparable conduct. But the structural architecture — pharmaceutical companies paying to reach physicians at clinical decision moments inside a tool physicians trust to be unbiased — is identical. Institutional compliance teams will note this parallel during any contract review.

3c. The advertising-enterprise contradiction. OpenEvidence cannot simultaneously be the pharma-advertising-funded free tool that 40% of physicians use voluntarily AND the enterprise-grade clinical AI that academic medical centers deploy system-wide under governance frameworks. Health system compliance teams will not approve pharma advertising in clinical decision workflows. The company must resolve this structural fork — likely through a tiered architecture with an explicitly ad-free enterprise version — before institutional deployment can scale to the level the $12B valuation implies.

Risk 4 — Diagnostic deskilling in medical education

This is the risk most visible to your GME program directors and least visible to hospital administrators. The academic concern is not that the tool gives wrong answers. It is that providing the right answer too quickly — before a trainee has engaged in the cognitive work of formulating a differential and constructing a management plan — removes the productive difficulty through which clinical reasoning develops.

There is no published prospective longitudinal study of OpenEvidence's effect on evidence appraisal skill development in trainees. Given that 40%+ of U.S. physicians use it daily — many of whom are residents — this gap is a material oversight in the medical education research agenda. The Katz framing is worth repeating: the manual process of following PubMed links, reading methodology sections, understanding evidence hierarchies, and sitting with diagnostic uncertainty before resolving it is inefficient. It is also how clinical judgment forms. OpenEvidence compresses that process. Whether compression aids or impairs the formation of clinical expertise over time is an empirical question that the field has not answered.

Risk 5 — Regulatory trajectory

OpenEvidence currently sits outside the FDA premarket notification pathway by positioning itself as a "support" tool that enables clinicians to independently review recommendations. As the platform expands into agentic reasoning (DeepConsult), automated differential diagnosis, Coding Intelligence MDM rationale written into permanent clinical notes, and prior authorization generation — functions that increasingly "drive" clinical and billing decisions rather than "inform" them — the gap between the regulatory positioning and the actual clinical function narrows.

The FDA in January 2026 issued updated Clinical Decision Support guidance requiring that AI tools be designed so clinicians can evaluate and question AI recommendations rather than accept them automatically. This guidance was a direct response to documented automation bias concerns. Whether OpenEvidence's current design — where confident, well-cited answers are presented without confidence intervals, uncertainty estimates, or domain-specific caveats about reliability — meets this standard has not been tested in an enforcement context.

Figure 12 — Risk heat map: probability vs. institutional impact
Fig. 12. Risk assessment matrix. Bubble size represents estimated time to materialization. The advertising-trust incident (top right) has low probability but would be immediately existential at current valuation. Epic competitive pressure (center) has the highest combined probability-impact score over the 18-month horizon.
RiskProbability (12–24 mo)Impact if realizedWhat to watch
Epic native AI builds evidence synthesis before OE achieves deep embed Medium-High High Epic Art agent feature roadmap; Cosmos AI guideline integration announcements
Advertising-trust incident (study linking ad exposure to prescribing behavior) Low Existential FDA/FTC regulatory activity; investigative journalism; peer-reviewed prescribing behavior research
GV information access to behavioral dataset Medium Critical Google Health/MedLM clinical AI announcements that suggest behavioral data access
Institutional bans at academic medical centers due to COI concerns Medium High P&T committee and compliance team decisions at major AMCs
FDA reclassification requiring premarket notification for agentic features Low-Medium High FDA CDS guidance updates; enforcement actions against comparable tools
Physician consultation growth plateaus before enterprise revenue is material Low (currently) Medium Monthly consultation metrics; enterprise contract announcements
Frontier model labs negotiate parallel journal licensing agreements Low-Medium High NEJM, JAMA, Cochrane licensing announcements with OpenAI, Anthropic, or Google
Table 10. Risk register. Probability and impact assessments are analytical judgments based on publicly available information, not actuarial estimates.
Section 18

How to Evaluate Any Clinical AI Tool: A Decision Framework for Clinicians

Framework · Clinical AI Evaluation
Before you trust it, interrogate it. Before you deploy it, break it.
Most clinical AI tools are adopted because they are free, fast, and useful in the moment — without anyone asking whether they are safe, durable, or institutionally defensible. This framework forces the three questions that matter before any AI tool enters your clinical workflow.
1  Does it kill a measurable pain?
2  Can you integrate and sustain it?
3  What is the worst failure mode?
The clinical scenario used throughout this guide

You are a third-year internal medicine resident at OLOLRMC. Your program director announces that a company is offering your residency program a free enterprise license for "MedSynth AI" — a new clinical decision support tool that answers point-of-care questions and suggests ICD-10 codes from your clinical notes. You are asked to evaluate it before the program commits. Here is how to do that correctly.

1
Does it kill a measurable pain?
Your job: articulate one specific problem with a number attached to it. If you cannot measure the pain, you cannot measure whether the tool fixed it.

Before evaluating any tool, complete this sentence: "[Tool] reduces [specific metric] caused by [concrete problem] in [precise context]."

Resident scenario — good answer
"MedSynth AI reduces time spent on ICD-10 coding after clinic — currently 20–30 minutes per half-day session — by suggesting codes from my transcribed notes. That frees time for chart review or post-call rest."

Specific metric. Concrete context. Plausible mechanism. This is a real pain point.
Resident scenario — vague answer (push back on this)
"It will make us more efficient and improve our clinical decisions."

This is aspiration, not a pain point. What is broken right now? What number is suffering? Rephrase until you have a specific sentence.

Once you have the sentence, drill into three numbers before proceeding:

  • Baseline today: What is the actual current number? (e.g., 25 min/session × 4 sessions/week = 100 min/week on coding)
  • Target: What improvement justifies adopting this tool? Require at least 20–30% (e.g., reduce to 70 min/week)
  • 14-day measurement plan: How will you know by day 14 if it worked? (Track actual coding time with a timer for two pre-pilot weeks and two pilot weeks)
Stop the evaluation if you hear any of these
  • "It could improve efficiency" — could is potential, not reality. What does it do today for a real user?
  • Multiple problems listed — pick one. Which single pain point does this solve?
  • "We will figure out metrics later" — define the metric now or walk away. No metric = no accountability.
  • No baseline number available — if you cannot measure the pain today, you cannot prove the tool fixed it in 14 days.
2
Can you actually integrate and sustain it?
Most tools fail not because they stop working, but because no one owns them and they break. Name the owner before anything goes wrong.

Assume the tool solves the pain. Now ask: can your environment actually run this in practice?

Resident scenario — the three integration realities
Who owns this tool? At a GME program, the answer cannot be "the residents." Name a faculty champion or program coordinator who is accountable — who trains new interns each July, who fields problems, who decides whether to renew. If no one is named before launch, the tool will be abandoned by October.

What behavior change does this require of you? MedSynth AI requires opening a tab, reviewing suggested codes, and accepting or modifying them before finalizing your note. That is a workflow interruption. Will you genuinely do this at 11pm post-call? Be honest before recommending it to others.

What happens when it breaks? It is 2am Saturday. The tool is down. Notes are due in three hours. Do you code manually? Does the tool save drafts somewhere? Who do you call? Walk through every step. If you cannot answer this, your program is not ready to depend on the tool.
Integration kill criteria
  • "We will need custom code to connect it to our EHR" — how much? Who maintains that code after the vendor rep moves on?
  • "It requires manual checking a few times a day" — if more than twice a week, the maintenance load is too high for a residency environment.
  • "We have not decided who owns it yet" — no owner means dead in three months. Name a person today or do not start.
  • "It touches Epic, our coding software, and our billing system" — three integration points = three failure surfaces. Is the value worth that complexity?
3
What is the worst failure mode — and can you survive it?
Everything fails. The question is whether your patients, your program, and your institution can survive the specific way this tool fails. Name the worst case before deployment, not after.

Do not say "security incident" or "downtime." Name the concrete worst case — what actually breaks, what the damage is, and who gets hurt.

Resident scenario — naming the worst case for MedSynth AI

Vague (wrong): "It might give us wrong information."

Specific (right): "MedSynth AI suggests a 99215 E&M code with a plausible-sounding MDM rationale for a case that is genuinely a 99213. I am post-call, I review it quickly, and I accept it without re-reading the MDM rationale. The claim is submitted. It is audited three months later. I face a billing compliance finding. My attending is named in the review. The hospital pays a retroactive recoupment."

That is concrete. Plausible. Real consequences for real people.

Once you name it, probe three defenses before recommending deployment:

  • Architectural safeguard: What prevents this from happening? (e.g., the tool flags high-complexity E&M suggestions for mandatory review; the MDM rationale cannot be auto-accepted without a physician read)
  • Monitoring: What catches it quickly if the safeguard fails? (e.g., coding audit log reviewed weekly; denial rate tracked monthly with anomaly alerting)
  • Survivability: If all safeguards fail simultaneously, can your program survive this? (A single note compliance finding is survivable with documentation; systematic overcoding across 60 residents for six months is not)
Failure mode kill criteria
  • "That probably will not happen" — probably is not an acceptable risk framework for billing compliance or patient safety.
  • "We will deal with it if it comes up" — design around it now, or do not deploy.
  • The worst case involves a HIPAA violation, fraudulent billing, or patient harm AND you have no architectural safeguard — do not deploy. Full stop.
  • No monitoring that catches silent errors within 72 hours — AI tools that fail silently are the most dangerous class in clinical settings.

Apply this framework directly to what you now know about OpenEvidence:

Worst case scenarioOE architectural safeguardMonitoringSurvivable?
Hallucinated drug dose cited confidently (BiPAP case) Citation grounding only — does not verify dose against FDA label maximum None published; depends on physician catching it Depends on vigilance
Wrong E&M level + AI-written MDM rationale → audit triggered CCI rules engine for code compatibility — not MDM accuracy Standard billing audit; not AI-specific Survivable but costly
Prior auth letter misses payer's step therapy requirement; patient waits weeks None described for payer-specific denial criteria Payer denial — delayed days to weeks Patient harm risk from delay
Trainee enters patient name + MRN on free account without BAA Privacy policy disclaims liability for non-BAA PHI input None — OE cannot detect this in real time HIPAA violation — not survivable cleanly
Table 11. Applying the worst-case framework to OpenEvidence. Complete this table for any tool you are evaluating before recommending adoption.

The evaluation scorecard

Before recommending any AI tool to your program director, complete this scorecard. If any field is blank, the evaluation is not complete. Present this scorecard, not a paragraph of impressions, to your leadership.

Pain point in one sentence
Must name a specific metric and context
Primary metric + baseline number today
What is the actual number right now?
Named owner — a person, not a team
Who gets paged when it breaks at 2am?
2am incident response — step by step
If this is blank, you are not ready to deploy
Worst case scenario — concrete, not vague
Name the specific harm, not the category
Architectural guardrails against worst case
Not "we will be careful" — what is built in?
HIPAA / BAA status confirmed in writing
Institution name + BAA execution date
14-day kill criterion — a number, not a feeling
What metric on day 14 means we stop?

Questions to ask the vendor before signing

QuestionGood answerRed flag
"Show me observability from a real production deployment — not a demo." Actual uptime logs, P95 latency, incident history from a comparable health system "We can set up a demo" or "our platform is very reliable"
"Show me an actual export file of all data my institution generates." A real export in a documented, portable format with field definitions "We can discuss data access in our enterprise agreement"
"What broke for your last three customers who churned?" Specific, honest post-mortems — what failed and what was done about it "We have not had any churns" or deflection to a reference call
"What are your false positive and negative rates for [specific clinical function]?" Published rates with confidence intervals, tuned to a similar clinical population "Our accuracy is very high" without a number attached to it
"What is the cheapest way to get 80% of this value without your product?" An honest, specific answer — a vendor confident in their differentiation can answer this directly Offense, deflection, or "nothing else can do what we do"
"How is your advertising model governed relative to clinical content?" (OE-specific) "The systems are architecturally separated, confirmed by this independent audit." "We have a strict internal policy" without an external audit
Table 12. Vendor diligence questions. A vendor who cannot answer questions 1, 3, and 4 with specifics is not ready for clinical workflow integration at your institution.
The final decision gate

Recommend adoption only when you can honestly say yes to all of these. If you hesitate on any one, the default answer is no. Make the tool earn its way in.

You passed all three questions with concrete, specific answers — not aspirations
There is a named owner with capacity to maintain the tool — a person, not a team
You can measure success in 14 days with a clear kill criterion — a number, not a feeling
Pricing makes sense at 10× your pilot scale; the hidden enterprise tier features are costed in writing
You can export all your data in a usable, portable format — confirmed, not assumed
The worst case is survivable — you have documented architectural guardrails and monitoring, not a plan to "be careful"
HIPAA / BAA coverage is confirmed in writing for the specific use case you are deploying — not just "the platform is HIPAA-compliant"
Your program's AI governance body — or your hospital's AI Steering Committee — has reviewed and approved this deployment
If all conditions are met: Run the 14-day pilot with hard kill criteria. Set a calendar reminder. Measure ruthlessly. Most tools fail — if this one does, kill it fast and document why. The documentation protects your program and benefits the next evaluation team.
Applying this framework to OpenEvidence right now

If you apply this framework to OpenEvidence at your current rotation site, here is the honest scorecard: The pain point is real and measurable. The ownership question is unresolved at most Louisiana sites — no named institutional owner, no confirmed BAA at any of the four systems covered in this report. The worst-case scenarios are documented in Section 17. The guardrails are partial and unaudited. The 14-day metric is trackable. The pricing is zero for individual access. The lock-in risk is low for search; meaningfully higher if your institution embeds Coding Intelligence into its billing workflow. Use this as your baseline. Update it as your rotation site's AI governance framework matures.

DISCLAIMER & INSTITUTIONAL STANDING

This assessment is provided strictly for educational and informational purposes. The analysis, failure taxonomy, and strategic evaluations contained herein represent the professional observations of the author and do not constitute an official report, mandate, or clinical directive from the insitute

Usage of any AI tool should follow individual health system policies. This document does not establish institutional policy for Ochsner, LCMC, FMOL, or Lake Charles Memorial.

Source basis and disclosure
This report synthesizes 18 internal analytical documents, publicly available press releases and news reporting, peer-reviewed studies available via PubMed and medRxiv, and current web-searched information on Louisiana health system AI deployments. Financial figures are from public reporting and have not been independently verified. OpenEvidence was contacted for comment; no response was received before publication. This report does not represent the institutional position of any Louisiana health system or GME program. It is prepared for educational purposes and does not constitute clinical, legal, investment, or regulatory advice. The analyst has no financial relationship with OpenEvidence or any competing platform.

Key data sources: OpenEvidence press releases (2024–2026); Sacra equity research; MobiHealthNews; Healthcare IT News; Fierce Healthcare; Becker's Hospital Review; HealthLeaders Media; Verite News; Nabla Technologies press releases; FMOL Health / Ochsner Health / LCMC Health public communications; Cambridge Health Alliance NCT07199231; medRxiv preprint (MedXpertQA evaluation); 2025 Physicians AI Report; 2026 Hospitalist Survey (JMIR); Epic Systems User Group Meeting announcements 2025–2026.