OpenEvidence is the most widely adopted AI-powered clinical decision support platform among U.S. physicians as of April 2026, reporting daily use by more than 40% of the nation's physician workforce and over 20 million clinical consultations per month. The platform is valued at $12 billion following a $250 million Series D round in January 2026. This report synthesizes available evidence on its technical architecture, clinical performance, business model, competitive positioning, and failure taxonomy. It integrates current AI implementation data from Ochsner Health, LCMC Health, FMOL Health / Our Lady of the Lake, and Lake Charles Memorial Hospital — the four health systems across which readers of this report rotate. The report concludes with specific guidance for medical educators, program directors, residents, and students on appropriate use, required verification practices, PHI compliance, and the risks of automation bias and diagnostic deskilling.
Platform Overview
- Free, immediate access — no procurement wait
- Covers the full clinical day: search, documentation, coding, communications
- Dotflows let you customize for your specialty
- Prior auth automation reduces administrative time
- More functions = more failure points now chained together
- Coding Intelligence errors propagate into billing claims
- The tool you used last month is different from the tool today
- Free access creates HIPAA blind spots for trainees
- Know which product you are using at any given moment
- Never enter PHI unless your institution has an active BAA
- Treat each product function separately — the search engine has different reliability than Coding Intelligence
OpenEvidence is an AI-powered medical search and evidence synthesis platform that answers point-of-care clinical questions by synthesizing peer-reviewed literature from licensed sources and returning responses with inline citations. Access is free for NPI-verified U.S. healthcare professionals. The platform was founded in 2021 by Daniel Nadler — who previously built Kensho, a financial data AI company acquired by S&P Global for $700 million — and Zachary Ziegler (CTO). The company was incubated through the Mayo Clinic Platform Accelerate program, which remains an investor.
The platform's primary clinical use case is real-time evidence retrieval at the point of care. A physician types a natural-language question; the platform returns a synthesized, cited response within seconds. But OpenEvidence in 2026 is considerably more than a search engine. Six distinct clinical functions now run from the same platform:
Scale and Adoption
- 40%+ of U.S. physicians use it daily — peer validation at scale
- Accounts for 44.9% of all physician AI usage — dominant in a fragmented market
- Adopted equally across experience levels and specialties
- Usage peaks during clinical hours — behavior matches intended use
- Viral adoption bypassed institutional review at most hospitals
- 21% of physicians surveyed are highly skeptical — the concerned minority often has valid concerns
- Adoption speed ≠ safety validation speed
- Psychiatrists raised the most pointed concerns around bias and liability
- Don't let peer adoption substitute for your own critical evaluation
- Ask your program director whether your rotation site has a formal AI use policy
- If no policy exists, operate as if you are on a non-BAA individual account
The 40% daily physician penetration figure is cited by OpenEvidence in press communications and confirmed by independent survey data, but no published methodology specifies how "daily use" is defined or sampled. It is consistent with independent sources (the 2025 Physicians AI Report across 1,000+ physicians; a 2026 hospitalist survey at a large urban academic center), but should be understood as an approximation, not a precisely validated census figure.
What the independent physician data shows
A 2026 survey of hospitalists at a large urban academic tertiary care center found that 66.7% of respondents used AI in clinical practice, with OpenEvidence used by 51.9% of the total cohort — more than any other tool by a wide margin. The survey found no significant differences in AI usage by years of practice, shift type, sex, or provider designation. The assumption that younger physicians drive adoption disproportionately did not hold.
The 2025 Physicians AI Report (1,000+ physicians, 106 specialties) identified 71 unique AI applications in use. OpenEvidence alone accounted for 44.9% of all reported physician usage — more than all other 70 tools combined.
Physician sentiment
A Sermo poll found 20% of physicians described themselves as very supportive of OpenEvidence, 54% as cautiously open, and 21% as highly skeptical or concerned. Primary care physicians most frequently cited it as a major time-saver. Psychiatrists raised the most concerns — specifically around database preparation, inherent biases, and liability implications.
Funding trajectory
Technology and Architecture
- Trained on licensed NEJM/JAMA full-text — not the open internet or Wikipedia
- Refuses to answer when it cannot source a response (vs. hallucinating)
- NCCN treatment algorithms are retrievable as structured decision logic
- Computer vision models can parse figures and tables from papers
- Graph RAG claim is unverified — MedXpertQA performance suggests limits
- Chunking strategy for structured medical documents not publicly described
- Citation presence does not guarantee citation accuracy or applicability
- No published recall, precision, or retrieval quality metrics
- Always click the citation — verify the source actually supports the claim
- Check the publication date of the cited paper — guidelines evolve
- For NCCN queries, verify the platform is citing the current guideline version
- Treat complex multi-system queries with additional skepticism
OpenEvidence's core AI stack is trained exclusively on licensed medical texts — not the public internet. The architecture is a multi-agent hub-and-spoke system: a central "conductor" AI performs intent analysis and routes each physician query to the most relevant subspecialty model before assembling a final response. The platform describes 160+ subspecialty models. All responses are rejected if they cannot be linked to a verified source citation — the system refuses to answer rather than hallucinate an unsourced response.
The backend runs on Google Cloud Platform; the frontend on Next.js with Vercel Fluid compute (a production infrastructure detail confirmed by internal audit records, not marketing claims). The system is multimodal and multicloud, as stated by CEO Daniel Nadler at the JP Morgan Healthcare Conference in January 2026.
Content licensing: the primary structural moat
| Partner | Content scope | Strategic significance |
|---|---|---|
| NEJM Group | Full text, figures, tables from NEJM, NEJM Evidence, NEJM AI, NEJM Catalyst, NEJM Journal Watch — back to 1990 | NEJM named OE "best AI tool for medical information." Formal licensed agreement, not a web scrape. |
| JAMA Network | Full text from JAMA + all 11 specialty journals (Oncology, Neurology, Cardiology, etc.) | Covers the most-cited specialty journals in clinical medicine. |
| NCCN | Treatment algorithms, flowcharts, pathways — including oncological reasoning agents built around guideline structure | NCCN guidelines are the standard of care for oncology. No other AI has licensed the algorithm structure. |
| Cochrane | Full-text systematic reviews and meta-analyses, figures, tables | Highest level of published evidence synthesis. Differentiates OE from PubMed-only competitors. |
| ACC, ADA, AAFP, ACEP, ASAM, AAOS, GINA, NORD, SSO | Clinical guidelines, specialty society standards | Specialty society partnerships require individual negotiation. Breadth matters for cross-specialty queries. |
| Wiley, AMA | Broad peer-reviewed biomedical literature | Corpus expansion — 35M+ publications total. |
These institutions have a structural interest in OpenEvidence succeeding. Unlike OpenAI, Anthropic, or Google — which train competing frontier models and sell to hospital enterprise competitors — OpenEvidence uses licensed content for retrieval, not as training data for a competing AI platform. This distinction makes the licensing relationship less conflicted and more durable than it would be with a general-purpose AI lab.
Graph RAG and multi-hop reasoning: the claim and the gap
OpenEvidence describes its retrieval system — called SystemAI — as a graph-based retrieval-augmented generation architecture. Medical knowledge graphs map relationships between diseases, symptoms, drugs, and biological pathways. The system traverses these relational pathways to aggregate evidence across multiple documents before the generative phase. The intended capability: answering queries that require connecting a genetic marker to a drug's metabolic pathway to a secondary comorbidity — connections that are not explicitly stated in any single source document.
The graph RAG claim is the company's own description of SystemAI. No published independent technical audit of the architecture exists. Critically, the clinical performance data from MedXpertQA (see Section 4) shows the system fails on precisely the multi-system, multi-document reasoning this architecture implies it should handle — suggesting the graph traversal capability may be limited to certain query types, or that it does not close the gap on complex subspecialty reasoning.
Additional technical questions the available evidence does not answer: how the licensed corpus is chunked across document types (NCCN algorithms structured as decision trees require different chunking than NEJM trial reports), whether embedding models are domain-specific or general-purpose, and what the failure recovery architecture looks like for a system processing 20+ million consultations per month.
The Alexandria / Atropos integration
When published literature cannot answer a clinical question — which is the case for an estimated 80% of daily decisions in some specialties — OpenEvidence queries Alexandria, a real-world evidence repository from Atropos Health containing over 10 million observational studies generated from EHR and claims data. A pipeline analysis of approximately 3,000 complex physician questions found that PubMed-based retrieval answered approximately 44% of queries; the Alexandria integration provided actionable answers for an additional 50.1%. These figures come from Atropos Health's own published research and represent the best available evidence, not independent third-party validation of OpenEvidence's specific implementation.
Clinical Performance: What the Evidence Actually Shows
- Very high scores on clarity (3.75/4.0) and relevance (3.75/4.0) in prospective study
- Excellent for validating a hypothesis you've already formed
- Fast retrieval of evidence-based support for documentation
- Strong for common conditions with abundant published evidence
- Impact on altering clinical decision: 1.95/4.0 — it confirms, rarely redirects
- 34% accuracy on complex subspecialty board questions
- Never outputs "I don't know" — generates confident answers regardless
- Only 25% agreement with a comparator AI on the same cases
- Form your differential first, then use OE to check it — not the other way around
- For subspecialty or complex presentations, require primary source verification
- Do not use OE output as a substitute for a specialist consult in your blind spot areas
- Teach and document this hierarchy to your residents
OpenEvidence achieved 100% on the United States Medical Licensing Examination using the Kung et al. dataset — a benchmark drawn from publicly available USMLE Step 1, Step 2, and Step 3 questions in multiple-choice format. The system not only answered correctly but generated accurate reasoning chains explaining the underlying physiology.
This benchmark evaluates encyclopedic recall of established medical facts and recognition of classic presentations. It does not evaluate multi-step heuristic reasoning under diagnostic ambiguity, performance on atypical presentations, or the kinds of clinical judgment exercised by attending physicians managing complex inpatients. No study has independently replicated this result with a different question set.
MedXpertQA: where performance falls
The more informative evaluation used the MedXpertQA dataset — drawn from specialty board examinations, with ten possible answer choices (A through J) to eliminate guessing. Two independent physicians evaluated responses manually.
| Metric | OpenEvidence | Deep Consult (comparator) |
|---|---|---|
| Highest overall accuracy | 34% | 41% |
| Best subsystem performance | 42.8% (muscular) | 55% (digestive) |
| Worst subsystem performance | 21.9% (skeletal) | 30% (respiratory) |
| Evaluator concordance (repeatability) | 77% | 72% |
| Discordance between both AI models | 75% — they agreed only 25% of the time | |
| Fabricated "K" answer (not among choices) | ~2% | 4–6% |
| "I don't know" responses | 0% | 0% |
Point-of-care impact study
A prospective observational cohort study (NCT07199231) at Cambridge Health Alliance enrolled PGY-1 through PGY-6 residents in internal medicine, family medicine, adult psychiatry, and child psychiatry. Complementary retrospective analyses graded OpenEvidence outputs across five clinical domains:
-
Relevance to clinical query3.75 / 4.0
-
Clarity of response3.55 / 4.0
-
Evidence-based support3.35 / 4.0
-
Overall physician satisfaction3.30 / 4.0
-
Impact on altering clinical decision1.95 / 4.0
The tool reinforced correct diagnoses and provided citable evidence efficiently. It rarely caught overlooked diagnoses or redirected a physician toward a substantially different management approach. For experienced clinicians, this is appropriate use — rapid validation with sourced backup. For trainees who have not yet formed an independent differential, it removes the cognitive effort that builds clinical reasoning skill over time.
Failure Mode Taxonomy
- Failures cluster at distribution tails — rare codes, complex cases, recent approvals
- Citation guardrail works reliably for source presence
- Common conditions are well-handled and failures are infrequent
- Failure patterns are consistent enough to build guardrails around
- FM-1: Highest confidence = highest error risk at the tails
- FM-2: Model generates a prior auth letter that will get denied
- FM-3: What you say in the exam room shapes the note and the claim
- FM-4: A hallucinated MDM rationale looks identical to a real one
- Be most skeptical when the answer sounds most authoritative
- Audit coding suggestions for complex encounters before submission
- Review prior auth letters against payer criteria, not just clinical logic
- Consider adversarial testing: give the tool a known-difficult case and see how it fails
OpenEvidence's six-product suite creates a chained failure surface — where search outputs inform Visits notes, which feed Coding Intelligence suggestions, which populate Tandem prior auth letters — errors in one function can propagate downstream. The four failure modes below apply across this integrated system.
The platform performs well on the middle of the training distribution — common chronic disease management (hypertension, type 2 diabetes, hyperlipidemia), standard E&M coding (99213/99214), routine prior auth letters for formulary-tier drugs. At the tails — rare presentations, uncommon CPT codes, recently approved treatments, complex subspecialty presentations — confidence does not decrease. The model generates confidently phrased, well-cited responses even when its training signal is thin.
Concrete example: A cardiologist queries a PCSK9 inhibitor combination approved eight months ago. OpenEvidence returns a confident, well-cited response based on pre-approval trial data, missing two post-marketing safety signals published after the training cutoff.
Why USMLE benchmarks mask this: USMLE tests the middle of the distribution. MedXpertQA tests the tails — and accuracy there is 21–34%.
Detection difficulty: Hard. Standard benchmarks actively hide this failure mode. Requires specialty-specific adversarial testing with tail cases.
The model's internal chain of thought correctly identifies a risk signal, but the final output overrides it with the statistically dominant response. In coding: the reasoning trace flags multiple comorbidities managed, data reviewed from external records, and high MDM complexity — but the final E&M code lands at 99214 rather than 99215 because 99215 is the statistical minority in training data. In prior authorization: the reasoning chain identifies that the requested biologic has limited evidence for the patient's specific indication variant, but the prior auth letter is confidently written because generating a supporting letter is the task the model was trained to do. The payer will likely deny it.
Detection difficulty: Hard. Requires logging and auditing the reasoning trace separately from the output. If only evaluating outputs — denial rates, coding accuracy — FM-2 is nearly invisible.
The Visits system ingests the full physician-patient encounter transcript. What the physician says in the room — how they frame the problem, what they emphasize or dismiss — shapes the note and coding. A physician who says "I think this is her anxiety again" while the patient's troponin is elevated steers the note toward a lower-acuity encounter. Prior notes in the FHIR-integrated chart — characterizations like "drug-seeking" or "frequent flier" — can anchor the current note's framing.
The pharmaceutical advertising vector is a structural version of this: A physician who viewed a diabetes drug advertisement during a previous search session carries that exposure into the next patient encounter. The encounter transcript may reflect the drug's marketed clinical positioning, which then propagates into the prior auth letter. This is not a clinical accuracy benchmark failure — it is a systematic prior-shifting mechanism that no citation guardrail detects.
Detection difficulty: Hard. Requires adversarial test cases where verbal framing contradicts structured lab/vital data.
OpenEvidence's safety architecture enforces citation grounding — responses are rejected if they cannot be sourced. This is a citation-level guardrail, not a risk-level guardrail. A response that cites a superseded guideline, accurately summarizes a methodologically flawed study, or presents a case-report-level drug interaction as equivalent to a well-replicated severe interaction will pass every citation check while still being clinically dangerous.
The MDM rationale written by Coding Intelligence into clinical notes is itself the guardrail: if the rationale sounds clinically coherent, it passes. A hallucinated MDM rationale that reads like a real one clears every surface-level review.
The BiPAP documented case: A hospitalist queried OpenEvidence for standard BiPAP settings for respiratory failure. The platform retrieved a specific clinical trial that used a narrow pressure range for its particular cohort, and presented those settings as the universal clinical recommendation. The response had citations. It looked authoritative. The settings were inappropriate for the general patient population.
Detection difficulty: Moderate. Citation presence is auditable; citation accuracy and clinical applicability require human clinical review.
| Failure mode | Severity | Detection difficulty | Priority |
|---|---|---|---|
| FM-1 · Inverted U (tail-case overconfidence) | Critical | Hard | 1 (tie) |
| FM-2 · Reasoning-output gap | Critical | Hard | 1 (tie) |
| FM-3 · Social context hijack + pharma ad vector | Critical / High | Hard | 2 |
| FM-4 · Guardrail miscalibration | High | Moderate | 3 |
Business Model and Conflict of Interest
- Company states content and ad systems are "fully unconnected"
- Free access enables use in under-resourced settings
- Advertising revenue cross-subsidizes features that benefit clinicians directly
- No evidence of direct content manipulation has been published
- No independent audit of the content-ad separation claim exists
- Amaro acquisition brought contextual ad targeting in-house (diabetes query → diabetes drug ad)
- Practice Fusion precedent: DOJ paid $145M for undisclosed pharma-influenced CDS
- Longitudinal prescribing behavior data exists inside OE and has not been published
- Notice when ads appear — log the drug category and the query context
- If your institution has a P&T committee, flag the advertising model for review before any enterprise deployment
- Ask: would I trust this answer the same way if I knew who advertised on this query?
CPMs range from $70 to over $1,000, targeting 760,000 NPI-verified U.S. prescribers at the precise moment they are answering a clinical question that may inform a prescribing decision. This is the most precisely targeted physician ad inventory in existence.
The structural conflict
OpenEvidence states that "the OpenEvidence information system and the ad display system are fully unconnected systems" and that "advertisements shall not be considered an endorsement." This is a self-attestation. No independent audit of this claim has been published. The Amaro acquisition in September 2025 — an ad-tech startup focused on advertising infrastructure and automation — brought contextual targeting in-house: a diabetes query triggers a diabetes drug advertisement.
When a doctor searches "treatment options for Type 2 diabetes," pharmaceutical companies can surface their FDA-approved treatments right there in the results — Google AdWords meets clinical decision support at the exact moment of prescribing consideration. — Repositioning analysis, April 2026
Practice Fusion, a clinical decision support company, paid a $145 million DOJ settlement for undisclosed pharmaceutical-influenced clinical decision support alerts. OpenEvidence is not accused of any comparable misconduct. But health system legal and compliance teams are aware of this precedent, and it is the lens through which institutional legal review of OpenEvidence deployments will occur. Any academic medical center deploying OpenEvidence enterprise-wide should document its analysis of the advertising-content separation claim before deployment.
The advertising-enterprise contradiction
The ad-supported free model that enabled 40%+ physician adoption is structurally incompatible with enterprise-level institutional deployment. Health system compliance teams routinely require ad-free environments as a contracting standard for clinical AI tools. This creates a structural fork: the two business models — pharma media and enterprise clinical AI — cannot both be primary. OpenEvidence has not publicly resolved this tension with a formally separated product architecture, though enterprise per-seat pricing exists for health systems like Mount Sinai.
| Revenue stream | Estimated size | Structural durability | Risk |
|---|---|---|---|
| Pharma advertising (primary) | $100–150M ARR | Moderate — depends on physician trust staying intact | High regulatory/reputational |
| Enterprise EHR subscriptions | Emerging (Mount Sinai model) | High — if ad-free version available and Epic cooperates | Moderate — Epic gating risk |
| Veeva Open Vista (pharma commercial) | Pilot — first revenue expected 2026 | Potentially high via Veeva channel | Deepens pharma conflict exposure |
| API licensing | Early / stated future stream | High if developed | Low |
Product Suite: Coding Intelligence and Prior Authorization
- Captures commonly missed CPT codes that physicians undercode out of habit
- CCI rules engine reduces claim denials for incompatible code pairs
- Prior auth automation reduces the most universally hated administrative task
- RVU sequencing maximizes reimbursement within compliance rules
- MDM rationale is written into the permanent clinical note — errors become part of the legal record
- High-complexity E&M code assignments are exactly where FM-1 and FM-2 collide
- Prior auth letters may be well-written clinically but miss payer-specific denial criteria
- Physicians bear the compliance liability for AI-suggested codes they approve
- Never auto-sign AI-generated coding without reading the MDM rationale
- For complex encounters, validate E&M level against AMA MDM complexity criteria independently
- Before submitting a prior auth letter, verify it addresses the payer's specific step therapy requirements
- Know that approval of the code is your professional and legal responsibility, not the AI's
OpenEvidence has reoriented from a reference tool into a revenue-generating enterprise asset. Hospital CFOs are more willing to pay for AI that demonstrably captures missed billing revenue than for AI that saves physician time. This is the stated industry logic behind both Coding Intelligence and the Tandem partnership.
Coding Intelligence (launched March 26, 2026)
| Feature | Mechanism | Financial impact |
|---|---|---|
| E&M leveling & MDM rationale | Analyzes visit transcript; suggests E&M level; writes MDM rationale directly into the clinical note | Ensures documentation supports the selected level; reduces successful payer audit challenges |
| CPT code suggestions | Surfaces context-dependent CPT codes based on documented actions; catches uncommon procedural codes | Captures missed reimbursement from habitually under-coded visits |
| RVU-optimized sequencing | Sequences multiple CPT codes by expected RVU impact | Maximizes revenue under Medicare's Multiple Procedure Payment Reduction rules |
| CCI compliance engine | Filters suggested codes through Correct Coding Initiative rules to remove incompatible procedure pairs | Reduces claim denials and compliance flags |
Tandem prior authorization (live April 3, 2026)
The Tandem integration automates the prior authorization workflow in four steps: the physician generates a prescription within the EHR; Tandem's system identifies the required criteria from the OpenEvidence-supported clinical notes and auto-populates the payer's required form, flagging missing information; on denial, the system auto-generates an evidence-backed appeal; finally, the system routes the approved prescription to the preferred pharmacy and enrolls the patient in applicable manufacturer savings programs.
The prior auth letter generation is particularly vulnerable to FM-2. The model may identify in its reasoning chain that the requested medication has limited evidence for the patient's specific indication variant — but the prior auth letter it generates is confidently written because generating a supporting letter is the trained task. The letter may be well-constructed clinically and still fail payer review because the model did not address the specific denial criteria for that payer and drug combination. Physicians should review AI-generated prior auth letters against payer-specific criteria before submission.
Competitive Landscape
- 98.7% of all AI clinical reference searches — usage dominance is real
- Free vs. $530/year for UpToDate Expert AI
- Licensed content competitors (ChatGPT Health, Claude) cannot access without equivalent negotiations
- Faster synthesis of newly published guidelines than editorially curated tools
- UpToDate Expert AI scores 71/100 vs. OE's 62/100 on clinical reasoning depth
- Epic's Art agent — natively inside the EHR — is a direct threat to OE's embedding strategy
- FMOL Health (your OLOLRMC rotation site) already adopted Epic's native AI scribe
- Dragon Copilot lists OE as a content vendor, not a platform partner
- Use the tool that best fits the clinical moment — not just the most familiar one
- For high-stakes management decisions with established standards of care, UpToDate Expert AI provides stronger editorial provenance
- Pay attention to which AI your rotation system has embedded — the tool in the EHR is the tool you'll actually use
| Platform | Revenue / scale | OE advantage | OE vulnerability |
|---|---|---|---|
| UpToDate (Wolters Kluwer) | $595M revenue, $500/seat, 30 years of physician habit, no advertising | Free; faster synthesis; current guidelines; AI-native interface | UpToDate has no advertising conflict; 30+ years of institutional trust; human editorial curation |
| ChatGPT Health (OpenAI) | 800M weekly users; HIPAA-compliant; institutional distribution | Exclusive journal licensing (NEJM/JAMA) not available to OpenAI; physician behavioral dataset | OpenAI scale; improving clinical capabilities; no advertising conflict |
| Claude for Healthcare (Anthropic) | $19B ARR; 80% enterprise revenue; CMS, ICD-10, PubMed integrations | Licensed private content (NEJM/JAMA vs. public PubMed) | Anthropic's enterprise relationships and Cowork momentum; $380B valuation |
| Doximity (Doximity GPT) | $570M TTM revenue; 80%+ physician penetration; NYSE listed | Deeper clinical decision support; journal licensing; evidence synthesis | Doximity has larger physician network; acquired Pathway Medical ($63M); active litigation with OE |
| AMBOSS | Education-focused; knowledge depth; learning science | Broader workflow integration; real-time evidence | AMBOSS has deeper knowledge structure for learning; complementary not competitive |
| Platform | Threat vector | Risk rating |
|---|---|---|
| Epic (Art / Cosmos AI) | Epic's native AI scribe (Art) released February 2026 with ambient documentation and order suggestions. Cosmos AI trained on 8+ billion patient encounters. Over 200 AI features in development for 2026. Epic is the EHR for Mount Sinai (OE's flagship integration) — if Epic builds native evidence synthesis, OE becomes optional rather than embedded. | Critical |
| Microsoft / Dragon Copilot | OE listed as one of three content reference partners alongside Elsevier and UpToDate. Content vendor position inside Microsoft's platform is replaceable. Microsoft is building native clinical decision support capabilities and has deep Epic integration via Nuance. | High |
| Google Ventures / MedLM | GV is OE's Series B and C lead investor while Google builds a directly competing physician workflow AI. GV board access creates potential information asymmetry around OE's most sensitive asset — the physician behavioral query dataset. This is the most structurally unresolved risk in the entire analysis. | High + governance tension |
| Veeva (Open Vista) | Aligned — not a threat. Veeva is a monetization partner for behavioral data via 1,500+ pharma customers. This relationship converts OE from a margin compressor (between physician and hyperscaler) to a token multiplier for Veeva's infrastructure. | Aligned |
See Section 9 for detailed discussion of Ochsner, LCMC Health, FMOL/OLOLRMC, and Lake Charles Memorial Hospital AI implementations and their relationship to OpenEvidence deployment.
Louisiana Health Systems: What's Deployed and What It Means
- All four systems on Epic — creating interoperability infrastructure for future OE enterprise deployment
- Ochsner's DeepScribe showing real deskilling prevention via ambient documentation (75% adoption, 3–4 min/note)
- LCMC's Nabla and FMOL's Epic Art scribe are reducing documentation burden across your clinical environments
- Louisiana MyChart Central statewide launch shows coordinated health IT investment
- No confirmed OE enterprise BAA at any of your four sites = individual accounts only = HIPAA exposure
- LCMC's Nabla deployment raised patient consent and transparency concerns (reported Jan 2026)
- Multiple AI tools across rotation sites creates inconsistent training and risk environments
- LCMHS is in early AI exploration — least infrastructure support for safe AI use
- Ochsner/LCMC/FMOL: Ask your supervisor if OE is covered under an institutional BAA before querying with clinical context
- LCMHS: Assume no enterprise coverage — de-identify all queries
- All sites: Ask your CMIO or informatics team what the AI governance policy is
The four health systems covered by this report encounter different AI technology landscapes. None of the four systems has publicly announced an OpenEvidence enterprise contract as of April 2026. But all are actively deploying AI in clinical workflows — predominantly ambient documentation — and all operate on Epic, creating the EHR infrastructure through which OpenEvidence can be accessed individually or (if a system contract is executed) enterprise-wide.
In October 2025, Ochsner Health, LCMC Health, Baton Rouge General, North Oaks Health System, FMOL Health, and Covington-based St. Tammany Health jointly launched Epic MyChart Central statewide — a unified patient portal login across all participating Epic organizations. This level of Epic integration across Louisiana health systems creates the interoperability infrastructure for enterprise OpenEvidence deployment, if any of these systems pursue it.
Ochsner Health
Ochsner is the largest nonprofit healthcare provider in Louisiana, operating 47 hospitals and 370+ health and urgent care centers, employing approximately 40,000 team members and 5,000 physicians, and treating 1.6 million patients annually. It is the largest academic medical center in Louisiana and the EHR market leader in the region, operating fully on Epic with AI Steering and Data Governance committees that review every AI deployment.
Ochsner's current AI deployment landscape is dominated by ambient documentation and predictive analytics. In July 2024, Ochsner signed an enterprise agreement with DeepScribe to deploy ambient AI documentation across all 4,700 clinicians at 46 hospitals and 370 centers. The pilot generated 75% clinician adoption during the initial launch, with one Ochsner nephrologist reporting documentation time reduced from "two to three hours a day to three to four minutes per note." An oncology NP noted the platform "captures way more than I'm able to, but writes it so succinctly."
Beyond ambient documentation, Ochsner uses AI for predictive sepsis detection, AI-powered radiologist diagnostic prioritization, pharmacy workflow automation for prior authorizations, AI-assisted patient messaging through Epic (piloted with 100+ clinicians), AI-driven appointment scheduling, and a suite of clinical AI agents for real-time health insights. AI tools for clinical use require mandatory training before access — this policy has expanded from voluntary to mandatory as use cases became more complex.
Ochsner's AI Steering Committee reviews every tool against patient privacy, core values, and clinical safety criteria. The DeepScribe ambient documentation platform is integrated with Epic. OpenEvidence is not among Ochsner's publicly announced enterprise AI deployments. Individual physicians may be using it via free NPI-verified access. Any institutional deployment would require Steering Committee review, including an analysis of the pharmaceutical advertising model and the advertising-clinical content separation claim — the same analysis required at any academic medical center.
LCMC Health
LCMC Health is a New Orleans-based, not-for-profit system operating eight hospitals — including University Medical Center New Orleans, Children's Hospital New Orleans (Manning Family Children's), East Jefferson General, West Jefferson Medical Center, Touro, Lakeview, Lakeside, and New Orleans East. It serves approximately 1.5 million annual patient visits with 2,800+ employed clinicians and operates in partnership with LSU Health Sciences Center and Tulane University School of Medicine.
LCMC reached HIMSS EMRAM Stage 7 (the highest EHR adoption certification) at University Medical Center and Children's Hospital. In December 2025, LCMC selected Nabla — a French ambient AI company — for a system-wide rollout integrated directly into its Epic EHR. Nabla captures clinician-patient conversations and automatically generates structured clinical documentation, with at-cursor dictation as an additional option. LCMC's CMIO Dr. Damon Dietrich has been explicit about the competitive rationale: "We had to get this to our doctors. We are mission-critical about this. We're going to lose doctors to our competitor."
LCMC's AI adoption was organized in three waves: employed doctors first (November 2025), residents and attending clinicians at affiliated Tulane and LSU academic programs second, and all remaining clinicians (including hesitant users) third. This phased approach means that as a trainee at LCMC Health affiliated with Tulane or LSU, you were likely included in Wave 2 of Nabla deployment.
In January 2026, Verite News reported that LCMC patients were not being explicitly told that their medical visits were being recorded and analyzed by Nabla's AI. LCMC cited Louisiana's one-party consent recording laws (requiring only provider consent, not patient consent, for recording). Nabla states it does not store audio and uses de-identified data. This episode illustrates a broader issue relevant to OpenEvidence: the gap between technical compliance and patient expectations of transparency. Academic institutions should document their consent and disclosure practices for any clinical AI tool, including OpenEvidence, before deployment.
FMOL Health / Our Lady of the Lake Regional Medical Center (OLOLRMC)
FMOL Health (Franciscan Missionaries of Our Lady Health System) includes Our Lady of the Lake in Baton Rouge — an 850-bed Level I trauma center and a primary teaching site for LSU School of Medicine GME programs, consistently named among the best hospitals nationally. OLOLRMC is LSU's Championship Health Partner, and as of March 2026, performed Louisiana's first single-port transabdominal colorectal surgery. In 2022, it upgraded to a Level I trauma center — the only one in the Capital Region and one of three in Louisiana.
FMOL Health's CIO Will Landry told Becker's in August 2025: "FMOL Health has had a lot of success with ambient listening technologies." In early March 2026 — following a one-month pilot — FMOL signed an enterprise license for Epic's native AI Charting (the "Art" agent), making it one of the earliest adopters of Epic's own ambient scribe, released in February 2026. FMOL's ambulatory CMIO Dr. Bobby Dupre cited the native Epic integration ("the linkage with native Epic functionality is just hard to beat"), lower hallucination rates compared to other ambient AI tools, built-in provider note personalization, and lower long-term maintenance cost as the deciding factors.
FMOL previously held individual licenses for two other AI scribes before selecting Epic's native tool. The enterprise license covers FMOL's nine-hospital system.
FMOL's early adoption of Epic's native AI Charting is the clearest local example of the competitive dynamic this report identifies at the national level: Epic entering ambient documentation directly reduces the space for third-party ambient AI tools. At the same time, Epic's native Art agent handles documentation — it does not provide the evidence synthesis, clinical reference quality, and licensed literature access that OpenEvidence offers. The two tools serve different clinical moments and are likely complementary rather than mutually exclusive at the point-of-care level.
Lake Charles Memorial Hospital (LCMHS)
Lake Charles Memorial is the primary hospital serving southwest Louisiana. It completed an Epic EHR implementation (go-live) and as of 2025 began exploring AI initiatives including automated discharge summaries and care plans built on the Epic infrastructure. This is an earlier AI maturity stage than the larger New Orleans and Baton Rouge systems — the institution is in the "education and exploration" phase rather than the enterprise rollout phase.
Individual physicians at LCMHS likely use OpenEvidence independently through free NPI-verified access, consistent with the national pattern of bottom-up adoption that preceded any institutional contract at comparable facilities nationally. No enterprise OpenEvidence deployment at LCMHS has been publicly announced.
| System | EHR | Ambient AI | OpenEvidence enterprise status | AI maturity |
|---|---|---|---|---|
| Ochsner Health | Epic (full) | DeepScribe (enterprise, 4,700 clinicians) | Not publicly announced — individual use likely | Advanced |
| LCMC Health | Epic (HIMSS Stage 7) | Nabla (enterprise, system-wide, Epic-integrated) | Not publicly announced — individual use likely; trainees in Wave 2 | Advanced |
| FMOL / OLOLRMC | Epic | Epic AI Charting "Art" (enterprise license, March 2026) | Not publicly announced | Advanced |
| Lake Charles Memorial | Epic (recent go-live) | Exploring AI initiatives — not yet enterprise ambient | Not publicly announced | Developing |
Regulatory, Legal, and HIPAA Considerations
- OpenEvidence is HIPAA-compliant with BAA available since April 2025
- Enterprise accounts (if your institution has one) provide HIPAA coverage for PHI input
- Platform provides citations — making your verification trail documentable
- Without a BAA, any PHI you enter is your sole legal responsibility
- No current "algorithmic malpractice" framework — clinical liability rests entirely with you
- FDA may reclassify expanding agentic features as medical devices requiring clearance
- Litigation with Doximity/Pathway is unresolved — legal basis for OE's trade secret claims is untested
- Confirm BAA status at every rotation site before entering any clinical context
- De-identify all queries on free individual accounts — always
- Document your independent clinical reasoning separately from AI-assisted steps
- Never represent AI output as your own independent clinical judgment in notes
OpenEvidence achieved full HIPAA compliance in April 2025. Covered entities can securely input protected health information, provided the hospital system has executed a Business Associate Agreement (BAA) with OpenEvidence. For individual physicians and trainees using free NPI-verified accounts without a BAA — the default situation for most users — the Privacy Policy explicitly states that any PHI submitted is deemed unintentional and remains the "sole responsibility of the user, for which OpenEvidence disclaims all liability."
Unless you are accessing OpenEvidence through an enterprise HIPAA-covered environment with a formal Business Associate Agreement — such as the Mount Sinai Epic integration or a formally contracted equivalent at your rotation site — do not enter any patient-identifying information into OpenEvidence. Entering identifiable patient data through a free individual account invites severe HIPAA violations and organizational liability. De-identify all queries before submission. This is not a preference; it is a legal requirement.
FDA regulatory positioning
OpenEvidence currently positions itself as a "support" tool that does not "offer diagnosis or treatment" — a classification that generally avoids FDA premarket notification requirements for higher-risk devices. As the platform expands into DeepConsult agentic reasoning, order-set recommendations, differential diagnosis generation, and Coding Intelligence MDM rationale written into permanent clinical notes, the gap between the regulatory positioning and the actual clinical function narrows. In January 2026, the FDA issued guidance reducing oversight of certain low-risk AI tools while simultaneously requiring clinical decision support tools to be designed so clinicians can evaluate and question AI recommendations rather than accept them automatically. The FDA regulatory ceiling for OpenEvidence's expanded product suite has not been tested.
Litigation: OpenEvidence v. Pathway Medical and Doximity
In February 2025, OpenEvidence sued Pathway Medical (a Canadian company) for trade secret misappropriation, alleging that Pathway used stolen NPI credentials to conduct "prompt injection attacks" to extract OpenEvidence's proprietary system prompts and architecture. In July 2025, Doximity acquired Pathway Medical for $63 million. In June 2025, OpenEvidence filed a separate suit against Doximity, alleging Doximity engineers posed as doctors to extract proprietary code via prompt injection. Doximity counter-sued, alleging false claims used as self-promotion. Bilateral litigation is ongoing.
A federal judge dismissed the original Pathway lawsuit in June 2025; OpenEvidence filed an amended complaint in August 2025 reframing the allegations as "an elaborate conspiracy." The case is now a groundbreaking test of whether prompt injection through a public interface constitutes trade secret misappropriation under the Defend Trade Secrets Act — a question no court has yet resolved.
Medico-legal liability
OpenEvidence's Terms of Use place the entire burden of clinical judgment on the human end-user. There is currently no legal framework for algorithmic malpractice. If a physician or resident relies on a hallucinated or misinterpreted guideline from OpenEvidence and patient harm results, the liability rests on the human physician for failing to meet the standard of care. The software developers and the corporate entity are shielded from clinical liability. This is not a hypothetical risk — it is the current legal reality for every AI tool in clinical use.
Strategic Position and Structural Durability
- NEJM/JAMA/NCCN licensing — individually negotiated, institutionally trusted, structurally hard to replicate
- Physician behavioral query dataset at 20M+ consultations/month — captures what physicians don't know, in real time
- NPI-verified prescriber identity — a CPM premium that Google and OpenAI cannot manufacture quickly
- Veeva Open Vista — the clearest example of the company monetizing its data asset via an aligned partner
- Citation-grounded RAG — every major competitor already demonstrates this
- Hallucination reduction methods — GPT-5 class models close this gap within 12–18 months
- USMLE 100% — a benchmark test, not a clinical performance validation
- GV is both an investor and a competitor via MedLM — the governance tension is unresolved
- OpenEvidence's content advantage over ChatGPT Health and Claude for Healthcare is real today — but ask vendors to show you how that gap holds in 18 months
- Any institution considering an enterprise contract should request GV's governance documentation before signing
- Watch whether Epic's Cosmos AI acquires guideline licensing — that is the signal that the structural moat is narrowing
What OpenEvidence genuinely owns
Two assets are structurally durable in ways that competitors cannot easily replicate:
The physician behavioral query dataset. Twenty million monthly clinical consultations from 760,000 NPI-verified healthcare professionals at the actual point of care generates data on what physicians are uncertain about — in real time, by specialty, by query type, by institution. This is structurally different from PubMed searches, patient health data, or consumer health queries. It captures clinical uncertainty, not clinical knowledge. This dataset cannot be reconstructed retroactively by any competitor who lacks the physician distribution scale.
Exclusive journal licensing agreements. NEJM, JAMA (all 11 specialty journals), AMA, NCCN, ACC, Cochrane, Wiley, and multiple specialty societies — each required individual institutional negotiation. The NEJM's naming of OpenEvidence as "best AI tool for medical information" reinforces the licensing relationship: NEJM now has a reputational stake in OpenEvidence's clinical performance. These institutions have a structural interest in OpenEvidence succeeding specifically because OpenEvidence does not train competing frontier models on their content — unlike OpenAI, Anthropic, and Google, all of which would be potential licensees with direct competitive conflicts.
What is being described as a moat but isn't
Citation-grounded RAG over medical literature is table stakes — every major competitor can demonstrate it. Hallucination reduction methods are differentiated today but will be closed by GPT-5 class models within 12–18 months. Physician brand trust is real but fragile, requiring no credibility incident to maintain. The USMLE 100% benchmark is impressive but tests a fundamentally different capability from the subspecialty reasoning physicians actually need.
The Google Ventures governance question
Google Ventures led both OpenEvidence's Series B and Series C. Google's MedLM directly targets the same physician workflow. GV board access creates potential information proximity to OpenEvidence's most sensitive and valuable asset — the physician behavioral query dataset. If GV's board materials include meaningful information about how that dataset is structured, queried, or monetized, the structural risk rating shifts from moderate to high. This is the single most important unresolved structural question in any assessment of OpenEvidence's strategic position, and no public information resolves it.
The two races OpenEvidence is currently running
Overall strategic rating: Moderate Risk — Improving. The position is not yet durable. It becomes durable if both races resolve favorably. It becomes high risk if the Google Ventures governance question resolves unfavorably, if the EHR race stalls at pilot stage, or if physician consultation growth plateaus before the second revenue stream is material.
How durable is OpenEvidence, really? A structural assessment
As physicians and educators evaluating whether to trust, teach, or institutionally endorse this platform, the question of sustainability is not academic. A tool embedded in clinical workflows that becomes commercially compromised, acquired, or displaced by a better-funded competitor creates real disruption — to your trainees, your programs, and your governance obligations. What follows is an honest assessment of where OpenEvidence is strong, where it is fragile, and what signals to watch.
The analysis below draws on a structured business evaluation framework used in technology investment. The reason it matters here is not because you are investors — it is because the platform's commercial incentive structure directly determines how it behaves in your clinical environment. A tool with a fragile business model or a compromised advertising relationship does not stay neutral. Understanding where the money comes from, and how durable it is, is part of responsible AI adoption.
Five dimensions of structural strength
A useful way to assess any AI platform's durability is to examine five structural dimensions: how much physicians trust it, how much contextual data it accumulates, how well it distributes to users, how differentiated its content is, and how it manages liability. OpenEvidence scores unevenly across these — and the gaps tell you something important about where the risks concentrate.
Trust — strong but borrowed. OpenEvidence's clinical credibility is high: 40%+ of U.S. physicians use it daily, it is HIPAA BAA-compliant, it is NPI-gated, and it is backed by Mayo Clinic as investor and partner. But the trust physicians place in its answers is largely a transfer from NEJM, JAMA, and NCCN — the sources it cites — rather than trust in OpenEvidence's own editorial judgment. This distinction matters: if a high-profile hallucination surfaces, or if the advertising model becomes publicly visible in a damaging way, the trust has no independent foundation to fall back on. It is strong today and fragile structurally.
Context — wide but shallow. The platform accumulates an extraordinary behavioral dataset — what physicians are uncertain about, at the moment of care, by specialty and query type. This is genuinely valuable. The limitation is that it does not own longitudinal patient data. The deep context moat — the actual EHR data — lives inside Epic and Cerner. OpenEvidence knows what your residents are asking. It does not know what happened to the patient afterward. That limits how far its clinical reasoning can evolve without an EHR partnership.
Distribution — the real asset. This is where OpenEvidence is structurally strongest. It became the default attention layer for U.S. physicians faster than any comparable platform in history. The free-to-physician model bypassed hospital procurement entirely, achieving 65,000+ new verified clinicians per month at zero institutional friction. Pharmaceutical CPMs of $70–$1,000+ confirm that the distribution is genuinely valued — pharma pays that premium because OpenEvidence delivers credentialed prescribers at the precise moment of clinical decision. No social media platform or general consumer health tool achieves this specificity. The vulnerability is that it was built on zero switching cost. Habit is not a contract.
Taste — curated but replicable. The platform's specialty-specific AI architecture and licensed journal curation are meaningfully better than open-internet AI tools. But taste in this context is largely an editorial architecture decision — it reflects the quality of source selection, not proprietary judgment at the point of output. A well-funded competitor with the same licensing agreements could replicate this approach. It is a real differentiator today with a closing window.
Liability — the most structurally thin dimension. OpenEvidence explicitly disclaims clinical responsibility. No licensed professional is accountable for the answers it generates. The platform's legal language is explicit: it "shall not be considered an endorsement" and outputs are "not a substitute for professional medical advice." This is the correct legal posture for the company — but it has a direct implication for clinicians. You carry the liability for any clinical decision made using this tool, regardless of what it told you. The platform's commercial incentive to disclaim liability aligns with your professional obligation to verify — but that alignment is not the same as protection.
Is it at risk of being displaced?
The direct answer is: yes, and the risk is more specific than the market appreciates. OpenEvidence is not a pure middleware play — the distribution position is real and the journal licensing creates genuine friction for replication. But the core product — AI-synthesized answers from medical literature — is exactly the capability that OpenAI Health, Google's MedPaLM successors, and Anthropic's enterprise health offerings are actively building toward. The model is not the moat. The physician habit layer and the journal licensing agreements are the moat.
The critical vulnerability is this: if a foundation model provider signs NEJM and JAMA — or simply negotiates the same content deals — and distributes through a channel physicians already trust (Epic's ambient AI, a GPT-4o health plugin, or a hospital enterprise contract), OpenEvidence's user base is one UX update away from erosion. The free model was brilliant for distribution and left switching costs at zero. That is the core tension.
Google Ventures led OpenEvidence's Series B and Series C funding rounds. Google's MedLM product directly targets the same physician workflow. GV board access means potential information proximity to OpenEvidence's most sensitive asset — the physician behavioral query dataset. This is the single most important unresolved structural question in any assessment of OpenEvidence's long-term independence. No public information resolves it. If you are at an institution considering an enterprise contract, this governance question should be on your due diligence list.
What OpenEvidence is likely to do next — and what it means for you
The following are the five most probable strategic moves OpenEvidence will make to entrench its position. Each is framed not as investment analysis but as a signal worth watching — because each move changes the nature of your relationship with the platform.
OpenEvidence is likely to build persistent physician profiles — specialty-verified query history, CME integration, peer benchmarking ("how does your prescribing pattern compare to similar oncologists?"). The stated purpose will be clinical utility: continuity, personalization, learning. The structural effect is that leaving the platform becomes costly because you lose your history. What to watch: any feature that makes your query log feel like a professional record. Once that data accumulates, it creates a switching cost that free alternatives cannot match — and it deepens the behavioral dataset OpenEvidence sells to pharma partners.
OpenEvidence already holds NEJM, JAMA, NCCN, ACC, and AMA licensing. The next tier is subspecialty society guidelines in high-liability fields — oncology, cardiology, nephrology — where the guideline is the standard of care. If these are locked as exclusive or first-look partnerships, a competitor AI cannot answer the same question with the same authority even if its model is better. What to watch: announcements from your specialty society about AI content partnerships. If your society's guidelines go exclusively to one platform, it matters for how you evaluate alternatives.
The current model shows pharmaceutical ads. The next model tracks whether the physician who queried "second-line EGFR therapy" actually ordered it, and sells that outcomes loop to pharma as closed-loop prescriber marketing intelligence. This is not speculative — it is the same model Doximity built with its prescriber data, generating $570M in annual revenue. What to watch: any feature described as "outcomes tracking," "post-query analytics," or "prescribing pattern benchmarking." If the platform can connect your query to your prescribing behavior, the commercial value of your usage increases dramatically — and the conflict of interest deepens correspondingly.
The free model captured individual physicians. The next layer is institutional embedding — where a CMO mandates OpenEvidence as the CDS tool and it appears as a line item in the operating budget. Enterprise contracts create contractual switching costs, audit trails, and institutional reporting that individual habit does not. What to watch: whether your institution is approached about an enterprise contract, and if so, what data-sharing provisions are embedded. An institutional contract with query-level reporting means OpenEvidence can see aggregate clinical uncertainty patterns across your entire physician workforce — which is operationally valuable to you and commercially valuable to them.
A horizontal product — one interface for all physicians — is easier to displace than 30 specialty-specific surfaces, each co-branded with the relevant professional society. If "OpenEvidence Oncology" is cited in tumor boards as the reference standard, or "OpenEvidence Cardiology" carries ACC co-branding, a competitor must unseat OpenEvidence in 30 subspecialty markets simultaneously rather than once. What to watch: specialty-specific product launches and society co-branding announcements. Each one represents both a genuine clinical improvement (more curated content) and a deeper entrenchment in that specialty's workflow.
The 10x model test: what happens when AI gets dramatically better?
A useful stress test for any AI platform is to ask: what survives when the underlying model gets ten times more capable — for free — from a frontier provider? For OpenEvidence, the answer is uncomfortable and worth understanding before your institution deepens its dependence on the platform.
What survives a 10x model upgrade: The physician identity graph — 40%+ of U.S. physicians verified and habituated — survives because it is about who the platform reaches, not how good the AI is. The pharmaceutical advertising channel survives for the same reason: CPM premium is a function of audience specificity, not model quality. The journal licensing exclusivity survives if it has been locked before competitors negotiate equivalent deals.
What gets threatened: The core query product — DeepConsult, the evidence synthesis engine, the clinical reasoning layer — is exactly what a GPT-5 class model will match or exceed without requiring OpenEvidence's proprietary architecture. Any differentiation OpenEvidence built on top of raw model capability is at risk of commoditization within 18–24 months.
OpenEvidence's advertising model creates a structural tension that a 10x model upgrade sharpens rather than resolves. A better model gives cleaner, faster answers — which reduces the surface area for ad placement per query. The product's growth and the business model's health may already be in tension: the more useful the AI becomes, the less time physicians spend browsing, and the fewer impressions pharma pays for. Watch whether OpenEvidence responds to this by increasing ad density, embedding ads more deeply in the answer rather than alongside it, or shifting toward outcomes-based pricing. Any of those moves would represent a material change in how commercial incentives intersect with clinical answers.
The bottom line for clinical educators: OpenEvidence is not going away in the next 12–18 months. Its distribution position is real, its content licensing is genuinely differentiated, and its physician adoption is too deep to unwind quickly. But it is not immune to displacement, and the commercial pressures that will intensify as it pursues profitability are directly relevant to the objectivity of the answers it returns. The appropriate institutional posture is: use it deliberately, verify consistently, and watch the business model as closely as you watch the benchmarks.
Guidance for Medical Educators, Program Directors, and Trainees
- Used after independent differential formulation — as a gap check, not a first answer
- Citations traced to primary source — builds evidence appraisal skill rather than shortcutting it
- Prompt engineering taught explicitly — better queries, better outputs, better learning
- Used for common conditions where reliability is documented as high
- Queried before the trainee has formed any independent assessment — automation bias in formation
- Used for subspecialty or complex presentations without mandatory verification
- PHI entered without confirmed BAA — direct legal and reputational exposure
- AI-generated MDM rationale signed in coding without independent review
- Sequence rule: Require trainees to formulate differential first; query second — enforce this in teaching rounds
- BAA confirmation: Post the institutional BAA status at every rotation site so trainees do not guess
- Verification requirement: At least one cited source per AI-assisted plan must be read in full — not just the synthesis
For program directors and department chairs
Before endorsing institutional use or integrating OpenEvidence into a formal curriculum, department leadership should document answers to the following questions:
1. Does the advertising system display inside the EHR-embedded version of OpenEvidence at your institution? If you are at a site with an enterprise contract (e.g., if your system follows the Mount Sinai model), confirm whether pharmaceutical advertising appears in the enterprise workflow. This determines whether the conflict-of-interest analysis applies to your institutional deployment.
2. Has your institution executed a Business Associate Agreement with OpenEvidence? Without a BAA, trainees must treat the platform as a non-HIPAA-covered environment and strictly de-identify all clinical queries.
3. What is the data governance position for clinician query data? OpenEvidence's Privacy Policy permits individual query data to be used for product improvement for non-BAA users. Confirm whether your institutional BAA (if it exists) restricts this use.
4. Has your P&T committee or compliance team reviewed the advertising-content separation claim? At a minimum, document that the claim has been reviewed and the Practice Fusion precedent has been considered.
Why this matters — the technology framework
OpenEvidence is not a passive reference tool. It is a retrieval-augmented generation system that takes your query, searches a licensed corpus of 35 million publications, retrieves semantically relevant chunks, and synthesizes a response using a large language model. That architecture has a specific failure geometry: it is calibrated on the distribution of medical literature, which means it performs well on the center of that distribution — common conditions, well-studied drugs, published guidelines — and poorly at the tails, which is precisely where clinical education is most consequential. Residents encounter tail cases. Board examinations test tail cases. Complex inpatients are tail cases.
The deeper pedagogical issue is that the model does not know when it is at a tail. It generates confident, well-formatted, citation-supported text regardless of whether the retrieval surface was rich or sparse. A trainee who has not yet learned to distinguish a confident synthesis of strong evidence from a confident synthesis of weak evidence cannot detect this from the output alone. They must go to the source. The skill of going to the source — and of knowing when to do so — is exactly what residency training is supposed to build. OpenEvidence, if misused, shortcuts precisely that skill.
There is a second technology-driven concern specific to teaching: automation bias compounds faster in junior learners than in experienced clinicians. An attending physician who encounters an OpenEvidence answer that contradicts their clinical gestalt will push back. A PGY-1 who has not yet developed a clinical gestalt has no internal counterweight. The AI answer becomes the reference point, not the check against one. This is the deskilling mechanism — not that the tool gives wrong answers often, but that it removes the productive uncertainty through which clinical pattern recognition develops.
How to build the framework into teaching
The pedagogical sequence that preserves clinical reasoning while capturing the tool's genuine utility is sequential, not concurrent. Require trainees to formulate a complete differential diagnosis and initial management plan independently before querying the platform. Then use OpenEvidence to audit that plan — checking for guideline updates, rare etiologies the trainee may have omitted, or recent trial data that changes a standard recommendation. This sequence is not a workaround. It is epistemically correct: the AI functions as a fast literature check on a hypothesis already formed, not as the origin of the hypothesis.
For evidence appraisal specifically, require residents to trace at least one OpenEvidence citation per query to the primary source. They should read the methods section and ask: What was the study population? Does it include patients like mine? What was the comparison arm? Was this a pre-specified analysis or a subgroup? This is not a burdensome requirement — it takes five minutes. But it converts the platform from an answer machine into a navigation tool for the primary literature, which is what evidence-based medicine training requires.
For teaching the technology framework itself, consider introducing OpenEvidence explicitly as a RAG system in orientation. Explain that it retrieves from a corpus, synthesizes with a language model, and links citations deterministically — but that citation presence does not guarantee citation accuracy or clinical applicability. Show trainees the BiPAP case: the platform retrieved a real trial, cited it accurately, and presented its cohort-specific parameters as universal recommendations. Ask them to find the gap. That exercise teaches more about clinical AI literacy than any policy document.
Three concrete curriculum structures
Daily rounds structure: Before morning rounds, residents prepare presentations independently — differential, assessment, plan — without AI tools. After presentations, the team uses OpenEvidence together to check for guideline currency on one key question per patient. The attending frames this as "let's see what the literature says" not "let's see what the AI says." This framing matters: it keeps the tool in the role of literature retrieval, not clinical authority.
Journal club integration: Assign residents to query OpenEvidence on the journal club paper's clinical question before reading the paper. Then read the paper. Compare what the AI synthesized to what the actual trial found. This exercise reliably surfaces FM-4 (guardrail miscalibration) — the platform often synthesizes prior literature and misses the nuance of the paper being discussed, even when that paper is in its licensed corpus.
Subspecialty rotation structure: At the start of any subspecialty rotation, have the fellow or attending generate five "tail case" queries — complex, atypical, or rare presentations from their specialty's boards. Run them through OpenEvidence. Review the accuracy together. This calibrates the trainee's trust in the tool for that specialty before they use it independently in clinical decision-making.
For residents, fellows, and students
OpenEvidence is a high-speed medical librarian, not an attending physician. Every output should be treated as a starting point for verification, not a definitive answer. The platform's citation links exist specifically so you can click through to the primary source. Use them, especially for:
- Any dosing or drug interaction recommendation
- Any guideline recommendation that will change your management
- Any query about a recently approved or recently updated treatment
- Any subspecialty query involving a complex or atypical presentation
Remember: the platform scored 21–34% on complex subspecialty board questions. That means the answer on the screen is wrong roughly one time in four at the subspecialty level — and the platform will not tell you when that is the case.
Unless you are accessing OpenEvidence through an enterprise-contracted HIPAA-covered environment at your rotation site (confirmed, not assumed), the following rules apply:
- Do not enter any patient name, date of birth, MRN, or other identifying information into OpenEvidence queries.
- Do not copy clinical notes or problem lists into the search field without removing all identifiers.
- Use clinical descriptors only: "63-year-old male with CKD stage 3 and newly started NSAID" — not the patient's name and MRN.
The OpenEvidence Privacy Policy states that PHI submitted through non-BAA individual accounts is deemed unintentional and is the sole responsibility of the user. This is not a technicality. This is your liability as a trainee.
Query phrasing substantially affects output quality. Broad, open-ended questions give the model more room to interpolate — including into domains where evidence is sparse. Constrained, specific queries produce more reliable results.
Better: "Summarize only the 2025 ACC/AHA guideline recommendations for anticoagulation in non-valvular atrial fibrillation in a patient with CKD stage 4"
Worse: "What anticoagulant should I use in a patient with A-fib and kidney disease?"
Including specific temporal parameters (guidelines updated in the last 2 years), specific society names (NCCN, ACC, CHEST), or specific study design preferences (RCT only, systematic review only) constrains the retrieval and reduces the risk of the model synthesizing across evidence of widely varying quality.
A practitioner who trained using UpToDate's full-text, manually curated, evidence-cited articles followed PubMed links, read methodology sections, understood why a guideline recommends what it recommends, and developed the habit of sitting with diagnostic uncertainty before resolving it. OpenEvidence compresses that process to 30 seconds.
For an experienced attending managing a common condition, this is efficiency. For a PGY-1 building clinical reasoning frameworks for the first time, this is a potential shortcut through the cognitive work that builds judgment. The Cambridge Health Alliance study found OpenEvidence primarily reinforced pre-existing physician hypotheses rather than redirecting clinical reasoning — which means it is unlikely to catch a wrong differential you've already committed to.
The question to ask yourself every time you open OpenEvidence is: Am I using this to check my thinking, or to replace it? The first is appropriate. The second is where the deskilling risk lives.
Unresolved Questions
- Does pharma advertising display inside the Epic-embedded enterprise version?
- What are the exact terms of the Mount Sinai and Sutter Health EHR agreements?
- What is the BAA scope relative to patient-context-aware queries inside EHR workflows?
- Is there an independent audit of the advertising-content separation claim?
- Does GV board access create information exposure around the behavioral dataset? (structural, not vendor-disclosable)
- Does OpenEvidence use shift prescribing behavior in aggregate? (requires longitudinal study)
- Does OE use impair evidence appraisal skill in trainees over time? (requires prospective GME study)
- Does Coding Intelligence MDM rationale meet AMA complexity criteria deterministically? (requires compliance audit)
- Request written answers to the four vendor-answerable questions before any enterprise contract
- Flag the prescribing behavior and trainee deskilling questions to your GME research committee — these are publishable studies
- Document the unresolved questions explicitly in your AI governance review — accepting known unknowns is different from ignoring them
The following questions are not answered by publicly available information as of April 2026:
| Question | Why it matters | Gap type |
|---|---|---|
| Does pharmaceutical advertising display inside the Epic-embedded version at Mount Sinai — and by extension, at any enterprise deployment? | Determines whether the ad-enterprise contradiction has been resolved in practice or just in theory | Governance |
| What are the contractual terms of the Mount Sinai and Sutter Health EHR agreements? | Revenue share, exclusivity, feature scope, and term length determine how deep the workflow ownership actually is | Commercial |
| Has Google Ventures' board access created any information exposure around the physician behavioral query dataset? | Described as the most important unresolved structural question in the analysis. If GV board materials include meaningful dataset information, the risk rating shifts from moderate to high. | Governance |
| Has the advertising system been independently audited for separation from clinical response generation? | The company's self-attestation is insufficient at $12B valuation and 760K physician users | Regulatory |
| What is the actual chunking strategy for medical literature — particularly for structured documents like NCCN algorithms? | Chunking is the most consequential RAG design decision and the most common source of retrieval failure. Not publicly described. | Technical |
| Are Coding Intelligence MDM rationale outputs validated against AMA MDM complexity criteria deterministically? | A hallucinated MDM rationale that reads like a real one will pass every citation guardrail while potentially constituting a compliance violation | Clinical / compliance |
| What are the longitudinal effects of OpenEvidence use on prescribing behavior? | 20M+ monthly consultations with query-level data on what drugs physicians ask about at the moment of prescribing consideration. This data exists inside OpenEvidence and has not been published. | Public health |
| Does OpenEvidence use affect evidence appraisal skill development in trainees longitudinally? | At 40%+ of U.S. physicians using it daily — many of whom are residents — this is a medical education infrastructure question with no published prospective data | Medical education |
RAG Architecture: A Technical Assessment
- Domain-specific training means clinical terminology is interpreted correctly (e.g., "significant" = statistical, not colloquial)
- Computer vision for figures and tables — can retrieve forest plots and treatment flowcharts, not just prose
- Deterministic citation linking prevents unsourced generation for common queries
- Graph RAG — if working as claimed — handles multi-concept clinical queries better than vector-only retrieval
- Chunking strategy for structured documents (NCCN decision trees, Cochrane GRADE tables) is not publicly described
- No published recall@k, citation precision, or source-selection quality metrics
- Graph RAG performance on MedXpertQA (34%) does not match the multi-hop capability claimed
- Both OE and competitors are closed systems — independent technical audits do not exist
- Treat architecture claims as design statements, not clinical validation
- For queries crossing multiple clinical domains (e.g., genetic marker → drug metabolism → comorbidity presentation), require primary source verification regardless of how confident the synthesis reads
- A well-cited answer is not the same as a correctly sourced and applicable answer — read the cited paper, not just the synthesis
This section applies the RAG evaluation framework to OpenEvidence — not as background reading, but as an evaluative framework. The core question is not whether OpenEvidence uses RAG. Every major competitor does. The question is how it is implemented across chunking strategy, embedding design, retrieval depth, and citation verification — and where the public evidence is too thin to make confident claims.
Judging on raw RAG and retrieval quality, OpenEvidence looks more relevant than most alternatives in this space. Low-80s accuracy on an end-to-end medical QA benchmark is respectable. But in a clinical setting, that number alone is not sufficient grounds for unqualified confidence — especially without published retrieval metrics such as recall at k, citation precision, or source-selection quality. UpToDate is stronger as a curated editorial reference, but it is not exposing a retrieval system in the same way. These are different architectures solving different problems.
What RAG actually means in this context
Standard retrieval-augmented generation converts a user query into a high-dimensional vector embedding, searches a database for semantically similar text chunks, retrieves those chunks, injects them into a large language model prompt, and generates a synthesized response. The quality of the output depends almost entirely on three design decisions that happen before the language model sees anything: how the source documents are chunked, what embedding model converts text to vectors, and how the retrieval step selects which chunks to surface.
In medical literature, each of these decisions is non-trivial. A NEJM randomized controlled trial report mixes methods, baseline characteristics, results tables, subgroup analyses, and discussion in a structure that does not naturally align with fixed-size chunking. An NCCN treatment algorithm is a decision tree, not prose — chunking it by token count destroys the conditional logic ("if HER2+ and prior anthracycline exposure, then...") that makes the guideline clinically useful. A Cochrane systematic review has explicitly graded evidence quality (GRADE methodology) in a structured summary format that carries more clinical weight than the narrative text. Splitting any of these at arbitrary boundaries — the "torn textbook problem" in RAG literature — is the most common source of retrieval failure and subsequent hallucination.
| RAG design dimension | What OE claims | What is independently documented | Gap assessment |
|---|---|---|---|
| Chunking strategy | Graph-based retrieval with knowledge graph traversal (SystemAI); computer vision for figures and tables | Not independently described. The computer vision claim for figures/tables implies structure-aware chunking. | Material gap — chunk boundary design for NCCN decision trees, Cochrane GRADE summaries, and trial subgroup tables is undisclosed |
| Embedding model | Domain-specific models trained on licensed medical texts | Not independently verified. Claim is plausible given training corpus. | Moderate gap — domain-specific embeddings matter for terms like "significant" (statistical vs. colloquial) and "negative" (test result vs. bad outcome) |
| Retrieval depth / multi-hop | Graph traversal enables multi-hop reasoning across documents not explicitly linked in any single source | MedXpertQA performance (34% on complex subspecialty) suggests multi-hop capability has real-world limits | Claim-evidence gap — the architecture implies more than performance data supports |
| Citation verification | Deterministic citation linking — answers rejected if not properly sourced | Confirmed by multiple independent clinical evaluations. Responses consistently include inline citations. | Well documented — but citation presence ≠ citation accuracy (see FM-4) |
| Recall and precision metrics | Not published | No independent evaluation of recall@k, citation precision, or source-selection quality exists in public literature | Not evaluable — standard retrieval metrics are absent from all public reporting |
| End-to-end accuracy | 100% USMLE; low-80s on medical QA benchmarks cited by company | Independent: 34% on MedXpertQA (complex subspecialty); high-80s on simpler QA in published peer review | Benchmark-dependent — performance varies significantly by difficulty and domain specificity |
The graph RAG claim examined
The most architecturally interesting element is what OpenEvidence calls SystemAI — a graph-based retrieval layer that maps relationships between biomedical entities (diseases, phenotypes, drugs, biological pathways) and traverses those relational pathways to aggregate evidence across multiple documents. In standard vector RAG, if a physician asks about the clinical significance of a specific CYP2D6 polymorphism on a drug's metabolism and its downstream effect on a comorbidity, the system struggles unless all three concepts appear together in a single source document. Graph RAG is specifically designed to close this multi-hop gap by traversing entity relationships rather than searching for vector similarity alone.
This is a real architectural distinction with real clinical value — if implemented correctly. The concern is that MedXpertQA performance at 34% on complex subspecialty scenarios is precisely the domain where multi-hop graph traversal should provide the most benefit. Either the graph structure is not yet dense enough in the tail-case subspecialty domains tested, or the benefit exists but does not close the gap sufficiently on questions designed to require cross-document reasoning. The files available do not resolve this question, and the company has not published the retrieval-layer architecture in sufficient detail to evaluate it independently.
What the retrieval quality evidence actually supports
The Cambridge Health Alliance prospective study — the most rigorous independent clinical evaluation in the public record — found that OpenEvidence scored well on clarity, relevance, and evidence-based support, but had low impact on altering clinical decision-making. This pattern is consistent with a retrieval system that surfaces the right literature accurately for common queries but does not meaningfully expand clinical reasoning for complex or ambiguous cases. That is not a failure; it is an accurate characterization of what the tool currently does well. The problem arises when the tool is used as if its capabilities extend to the complex-reasoning tier — which is where the architecture claims live.
For the "long tail" of medical literature — niche queries about rare presentations, recently published guideline updates, drug interactions in uncommon patient populations — OpenEvidence's access to 35 million licensed publications gives it a genuine advantage over curated editorial tools that may not have updated their monographs yet. A UpToDate author team takes weeks to months to incorporate a major new guideline. OpenEvidence can surface the guideline text the day it is published. This is real clinical value for a specific and important use case.
OpenEvidence vs. UpToDate Expert AI: Fit-for-Purpose, Not Replacement
- A major guideline was updated in the last 6 months and you need the new recommendation now
- The clinical question involves a rare presentation or long-tail literature query unlikely to be in a curated database
- Speed matters more than editorial provenance — point-of-care rapid synthesis for common clinical decisions
- Cost is a constraint — free access at LCMHS or in community settings where UpToDate is not available
- You are making a high-stakes management decision where the recommendation needs to survive attending scrutiny
- You need to know where evidence ends and expert opinion begins — UpToDate authors label this explicitly
- Your institution has a P&T or compliance concern about pharma advertising adjacent to clinical queries
- The query is about well-established standards of care for common conditions — UpToDate's editorial depth is an advantage here
- If your rotation site provides UpToDate access, use both tools deliberately — not interchangeably
- Use OE to check whether a guideline has been updated since the UpToDate monograph was last revised
- Use UpToDate Expert AI for the management plan you will defend to an attending on morning rounds
- Neither tool eliminates the need to read a primary source for complex or atypical presentations
UpToDate Expert AI is a generative conversational interface built exclusively on UpToDate's own expert-authored, peer-reviewed content repository. It is not trained on the open web, does not draw from raw journal databases, and does not expose a retrieval system in the same way OpenEvidence does. It applies a generative AI layer on top of human-curated clinical summaries that UpToDate has been building for 30 years.
These are two fundamentally different architectures solving the same surface-level problem — answering clinical questions quickly — but with different underlying assumptions about where accuracy lives.
| Dimension | OpenEvidence | UpToDate Expert AI |
|---|---|---|
| Knowledge base | 35M+ licensed peer-reviewed publications (NEJM, JAMA, Cochrane, NCCN, Wiley, specialty societies) — raw literature, not curated summaries | UpToDate's own expert-authored content library — 30+ years of physician-authored, peer-reviewed clinical summaries explicitly distinguishing evidence from expert opinion |
| AI architecture | Dynamic RAG over live literature corpus; graph-based retrieval; agentic reasoning (DeepConsult). Answers can surface literature published days ago. | Generative AI layer over a static (update-cycle-dependent) curated corpus. Answers reflect the quality of UpToDate's editorial process, not real-time literature. |
| Evidence currency | Can surface a new guideline or trial the day it is published in a licensed journal | Currency depends on UpToDate's editorial update cycle — weeks to months for major guideline revisions |
| Conflict of interest | Pharmaceutical advertising displayed alongside clinical queries. No independent audit of content-ad separation. | No advertising. Subscription model. Human authors explicitly disclose where evidence ends and expert judgment begins. |
| Cost | Free for NPI-verified U.S. clinicians | ~$530/year individual (U.S.); enterprise institutional pricing |
| Where it is better | Recent guideline synthesis; niche/rare literature searches; speed; edge cases where evidence exists in literature but not yet in curated databases; free access for under-resourced settings | Standard of care for common conditions; deep clinical reasoning with editorial provenance; institutional governance; no advertising conflict; explicit "expert opinion" labeling; 30 years of physician trust |
| Hallucination risk | Lower than general-purpose LLMs due to citation grounding; but citation presence does not guarantee clinical accuracy (see FM-4) | Claims elimination of hallucination by confining answers to curated content — plausible but still dependent on whether the curated content covers the query |
| Who should use it | Any clinician needing fast, cited synthesis at the point of care — especially for recent guidelines or long-tail literature queries. Requires verification for subspecialty or complex presentations. | Attending physicians making high-stakes management decisions; institutions needing editorial accountability; settings where the cost of error is highest |
Will dynamic RAG systems replace curated editorial trust?
The question is not which system will win. It is which system is reliable enough to trust for a specific clinical task — point-of-care synthesis, rapid literature exploration, or high-stakes decisions where perfectly governed editorial data matters more than speed.
The honest answer, based on available evidence, is that these tools occupy different reliability zones rather than competing for the same clinical moment. For a hospitalist who needs to know what the most recent ACC guidance says about anticoagulation after left atrial appendage closure in a patient with CKD stage 4, OpenEvidence can surface the answer in seconds with citations. For an attending deciding whether to escalate immunosuppression in a patient with complex inflammatory bowel disease and concurrent infection, UpToDate's expert-authored synthesis — which explicitly separates evidence from editorial judgment — provides a different kind of assurance that RAG over raw literature does not yet replicate.
Very high accuracy in this space always costs something. For UpToDate, the cost is money and editorial lag time. For OpenEvidence, the cost is the pharmaceutical advertising model and the conflict-of-interest architecture that comes with it. Systems that are very good at synthesis will still have edge-case failure modes; the real question is where each one is reliable enough to trust and where human verification remains non-negotiable.
If your program or rotation site provides UpToDate access, use both tools deliberately. OpenEvidence is better when you need fast synthesis of recent literature or want to know if a new guideline has been published since the UpToDate monograph was last updated. UpToDate Expert AI is better when you need the editorial provenance of a recommendation — especially when making a management decision that will require you to defend it to an attending. For high-stakes or complex presentations, verify OpenEvidence outputs against the primary source before acting. This is not a criticism of the tool; it is the appropriate epistemics for any AI-assisted reference at the current state of the technology.
The Business Model: The Doximity Playbook Explained
- Free access for all NPI-verified clinicians — including those in under-resourced community settings
- No institutional procurement barrier — a physician in a rural Louisiana practice gets the same tool as a Mount Sinai attending
- Ad revenue cross-subsidizes features (DeepConsult, Visits, Doctor Dialer) that physicians benefit from directly
- The Veeva/Open Vista move monetizes behavioral data in a direction (pharma commercial) that is somewhat aligned with clinician interests (better trial matching, drug discovery)
- Pharma advertising at $70–$1,000+ CPM displayed to verified prescribers at point-of-care = structural conflict of interest by design
- The Practice Fusion precedent means institutional compliance teams will flag this model
- Free-tier and enterprise-tier models are structurally incompatible at scale — one requires pharma ads, one requires their absence
- Reading OE's announcements chronologically shows each product launch serving both clinical value and monetization — these are not separable
- You are the product in the traditional sense — your verified prescriber attention is what is being sold
- This does not make the answers wrong, but it means you should notice what drug ads appear alongside which queries
- Ask your department: has anyone documented what advertising we see when querying the platform during rounds?
- If your program is considering an enterprise contract, ask explicitly whether the enterprise tier removes all pharmaceutical advertising
OpenEvidence follows a strategy that analysts call the "Doximity Playbook." Understanding this model is not optional for clinicians at academic medical centers — it directly affects how you should interpret the platform's incentives, whose interests are being served when you use it, and what the long-term trajectory of the service looks like.
How the model works
The playbook has three moves:
Move 1: Build the audience. Create a genuinely useful free tool for a hard-to-reach, high-value audience — in this case, NPI-verified U.S. prescribers. Distribute it directly to physicians, bypassing the 18-month hospital IT procurement cycle entirely. The tool must be good enough that physicians choose to use it voluntarily, not because their institution told them to. OpenEvidence achieved this — 40%+ of U.S. physicians use it daily because it makes their lives easier, not because of a contract.
Move 2: Monetize the attention. Once you have a verified, credentialed audience at the exact moment of clinical decision-making, pharmaceutical companies will pay extraordinary prices for access. OpenEvidence's CPMs of $70–$1,000+ compare to $5–15 for consumer social media because the context is unique: a verified prescriber is asking a clinical question that may directly inform a prescribing decision within the next 60 seconds. This is not display advertising in the traditional sense — it is advertising at the highest-intent moment in medicine.
Move 3: Use the free tier as a wedge to enterprise. Once physicians love the free tool, hospital Chief Financial Officers and IT administrators are willing to pay for enterprise contracts that embed the tool system-wide with HIPAA-covered enterprise governance. The Mount Sinai deployment is this move — the free tool became the proof of concept; the enterprise contract is the business.
| Model component | OpenEvidence | Doximity (original playbook) | UpToDate (traditional model) |
|---|---|---|---|
| Access model | Free for NPI-verified U.S. clinicians | Free for NPI-verified U.S. physicians | ~$530/year individual; institutional enterprise |
| Primary revenue | Pharma/device advertising at $70–$1,000+ CPM | Pharma advertising at $228 ARPU; $570M TTM revenue | Per-seat subscriptions; $595M revenue |
| Physician verification | NPI verification — 760K+ registered | NPI verification — 2M+ registered (80%+ of U.S. physicians) | Institutional subscription — user identity less granular |
| Revenue per user (ARPU) | ~$124 | ~$228 | ~$198 (estimated from $595M / ~3M users) |
| Enterprise upsell | Health system EHR contracts (Mount Sinai model) | Pharma marketing solutions, telehealth | Core product is enterprise — no upsell required |
| Advertising conflict | High — pharma ads displayed alongside clinical decision queries | Moderate — pharma ads in professional network context | None — subscription model, no advertising |
Why this matters for LSU clinicians and trainees
The Doximity Playbook creates a structural reality that is worth stating plainly for trainees: OpenEvidence's revenue depends on pharmaceutical companies paying to reach you at the precise moment you are making clinical decisions. This does not mean the clinical answers are wrong. It does not mean the content is sponsored. The company states the content and advertising systems are separate. What it means is that the business model requires this structural proximity to exist — and that proximity creates a conflict-of-interest architecture that no amount of technical separation fully eliminates.
For an academic medical center clinician, the relevant question is not whether OpenEvidence's individual answers are biased. It is whether the systematic exposure to pharmaceutical advertising at clinical decision moments — repeated hundreds of times per month across 760,000 physicians — shifts prescribing behavior in aggregate, even slightly, even subconsciously. This is not a hypothetical that behavioral economics can easily dismiss.
In 2019, Practice Fusion — a clinical decision support EHR company — paid a $145 million DOJ settlement after it was found to have accepted payments from an opioid manufacturer in exchange for building clinical decision support alerts that recommended extended-release opioids during patient encounters. The alerts were not labeled as sponsored. Physicians did not know their CDS was influencing them toward a specific manufacturer's product. OpenEvidence is not accused of anything comparable. But the Practice Fusion case established the legal and reputational framework within which any academic medical center's compliance and legal teams will review a pharma-advertising-adjacent clinical AI tool. This is not theoretical risk management — it is the direct precedent that institutional lawyers cite.
The announcements page as a business model readout
Reading OpenEvidence's official announcements chronologically from 2024 to April 2026 reveals the business model evolution in real time:
Key Risks: Expanded Analysis
- Accuracy risk: predictable and manageable with the scenario-type table in this section
- PHI risk: entirely within your control — de-identify all queries on non-BAA accounts
- Coding compliance risk: review AI-generated MDM rationale before approving any E&M code
- Deskilling risk: sequence your use (differential first, OE second) — this is a behavioral habit, not a technology fix
- Advertising-trust incident risk: requires P&T review and documented institutional position
- Epic competitive pressure: requires monitoring Epic's Cosmos AI and Art agent roadmap quarterly
- GV governance risk: requires vendor disclosure before any enterprise contract
- FDA reclassification: requires legal review if your institution integrates agentic features into clinical workflows
- Trainees: Confirm your BAA status at each rotation site — do not assume enterprise coverage
- Residents/fellows: Test the tool on a known-difficult case from your specialty boards; document how it fails
- Program directors: Add OpenEvidence to your next AI governance committee agenda with the advertising-content separation question as the primary item
This section consolidates the key risks identified across this report, organizing them into five categories relevant to different audiences: clinical faculty evaluating the tool, program directors designing curriculum around it, hospital administrators considering enterprise deployment, trainees using it daily, and anyone tracking the platform's long-term sustainability.
Risk 1 — Accuracy and patient safety
The accuracy risk is not uniform. It follows a specific pattern: the platform is reliable for common clinical queries in well-represented domains, and unreliable in specific, predictable ways at the tails. The failure is not random noise — it is systematic overconfidence in domains with thin training signal.
| Scenario type | Reliability | Why | Verification required? |
|---|---|---|---|
| Common chronic disease management (HTN, T2DM, CAD) | High | Well-represented in training; abundant published evidence; standard guidelines well-indexed | Spot-check citations; acceptable for workflow use |
| Standard guideline lookup (ACC/AHA, CHEST, NCCN) | High | Licensed source content; guideline structure well-suited to retrieval | Check guideline publication date — verify it is current edition |
| Recent literature (published within 6–12 months) | High relative advantage | Licensed corpus updates faster than curated editorial tools; genuine advantage over UpToDate here | Verify source is final published version, not preprint |
| Drug interactions, common dosing | Moderate | Well-documented interactions are reliable; rare or poorly-documented interactions may be missed or misrepresented | Always cross-reference a dedicated drug interaction database (Lexicomp, Micromedex) for high-risk combinations |
| Complex subspecialty presentations (board-level) | Low (34% on MedXpertQA) | Thin training signal at distribution tails; multi-hop reasoning limited; model does not identify when it is uncertain | Mandatory specialist consultation or primary literature review |
| Rare disease / orphan condition | Variable | NORD partnership (March 2026) improves rare disease coverage; evidence base inherently sparse | Treat as exploratory — verify with specialist or disease registry |
| Pediatric dosing, obstetric management | Moderate-low | Pediatric and obstetric populations are routinely excluded from the RCTs that dominate the training corpus | Always verify against pediatric-specific or obstetric-specific references |
| Coding (E&M level, CPT) | Moderate — tail risk is high-stakes | Common codes reliable; rare codes and high-complexity MDM assignments are vulnerable to FM-1 and FM-2 | High-confidence coding outputs in rare code territory should route to human review before claim submission |
Risk 2 — Competitive pressure and platform durability
Three simultaneous competitive threats could degrade OpenEvidence's position within 18–36 months:
Epic's native AI (Art agent). Epic released AI Charting in February 2026 and FMOL Health signed an enterprise license within weeks. Art provides ambient documentation, note drafting, and order suggestions natively inside Epic Hyperspace — without requiring a separate application. If Epic extends Art's capabilities to include evidence synthesis drawing on its Cosmos dataset (260M+ patient records, 8B+ encounters), OpenEvidence's EHR-embedded value proposition is directly threatened. Epic is simultaneously the primary distribution channel OpenEvidence needs and its most credible long-term competitor.
General-purpose frontier models with healthcare deployments. ChatGPT Health (OpenAI) and Claude for Healthcare (Anthropic) are HIPAA-compliant and targeting physician workflows. They run on public data (PubMed) rather than OpenEvidence's licensed NEJM/JAMA corpus, which is the structural buffer today. The buffer narrows if frontier labs negotiate their own journal licensing agreements — a possibility the source documents flag as a monitored risk but not yet an observed event.
UpToDate Expert AI. UpToDate has 3 million users, deep EHR integration, 30 years of physician trust, no advertising conflict, and has now deployed a generative AI interface on its own curated corpus. For physicians at institutions with UpToDate enterprise licenses, the marginal utility of OpenEvidence narrows — particularly for high-stakes clinical decisions where editorial provenance matters.
Risk 3 — Monetization, physician trust, and the advertising conflict
This is the risk that most institutional compliance teams will flag first and that OpenEvidence's investor narrative addresses least directly. Three sub-risks are worth separating:
3a. Trust erosion from a single incident. The company's $12 billion valuation is priced on physician trust staying intact. If a credible investigative report, regulatory inquiry, or peer-reviewed publication demonstrates a statistically significant association between pharmaceutical advertising exposure on OpenEvidence and prescribing behavior — even a small, directional effect — the trust premium collapses. This is not a theoretical scenario; it is exactly the question that the longitudinal prescribing behavior data inside OpenEvidence could answer and has not published.
3b. The Outcome Health structural parallel. Outcome Health, a pharma-ad-supported clinical decision support company, saw its founders face criminal charges for fraudulent ad metrics. OpenEvidence is not accused of comparable conduct. But the structural architecture — pharmaceutical companies paying to reach physicians at clinical decision moments inside a tool physicians trust to be unbiased — is identical. Institutional compliance teams will note this parallel during any contract review.
3c. The advertising-enterprise contradiction. OpenEvidence cannot simultaneously be the pharma-advertising-funded free tool that 40% of physicians use voluntarily AND the enterprise-grade clinical AI that academic medical centers deploy system-wide under governance frameworks. Health system compliance teams will not approve pharma advertising in clinical decision workflows. The company must resolve this structural fork — likely through a tiered architecture with an explicitly ad-free enterprise version — before institutional deployment can scale to the level the $12B valuation implies.
Risk 4 — Diagnostic deskilling in medical education
This is the risk most visible to your GME program directors and least visible to hospital administrators. The academic concern is not that the tool gives wrong answers. It is that providing the right answer too quickly — before a trainee has engaged in the cognitive work of formulating a differential and constructing a management plan — removes the productive difficulty through which clinical reasoning develops.
There is no published prospective longitudinal study of OpenEvidence's effect on evidence appraisal skill development in trainees. Given that 40%+ of U.S. physicians use it daily — many of whom are residents — this gap is a material oversight in the medical education research agenda. The Katz framing is worth repeating: the manual process of following PubMed links, reading methodology sections, understanding evidence hierarchies, and sitting with diagnostic uncertainty before resolving it is inefficient. It is also how clinical judgment forms. OpenEvidence compresses that process. Whether compression aids or impairs the formation of clinical expertise over time is an empirical question that the field has not answered.
Risk 5 — Regulatory trajectory
OpenEvidence currently sits outside the FDA premarket notification pathway by positioning itself as a "support" tool that enables clinicians to independently review recommendations. As the platform expands into agentic reasoning (DeepConsult), automated differential diagnosis, Coding Intelligence MDM rationale written into permanent clinical notes, and prior authorization generation — functions that increasingly "drive" clinical and billing decisions rather than "inform" them — the gap between the regulatory positioning and the actual clinical function narrows.
The FDA in January 2026 issued updated Clinical Decision Support guidance requiring that AI tools be designed so clinicians can evaluate and question AI recommendations rather than accept them automatically. This guidance was a direct response to documented automation bias concerns. Whether OpenEvidence's current design — where confident, well-cited answers are presented without confidence intervals, uncertainty estimates, or domain-specific caveats about reliability — meets this standard has not been tested in an enforcement context.
| Risk | Probability (12–24 mo) | Impact if realized | What to watch |
|---|---|---|---|
| Epic native AI builds evidence synthesis before OE achieves deep embed | Medium-High | High | Epic Art agent feature roadmap; Cosmos AI guideline integration announcements |
| Advertising-trust incident (study linking ad exposure to prescribing behavior) | Low | Existential | FDA/FTC regulatory activity; investigative journalism; peer-reviewed prescribing behavior research |
| GV information access to behavioral dataset | Medium | Critical | Google Health/MedLM clinical AI announcements that suggest behavioral data access |
| Institutional bans at academic medical centers due to COI concerns | Medium | High | P&T committee and compliance team decisions at major AMCs |
| FDA reclassification requiring premarket notification for agentic features | Low-Medium | High | FDA CDS guidance updates; enforcement actions against comparable tools |
| Physician consultation growth plateaus before enterprise revenue is material | Low (currently) | Medium | Monthly consultation metrics; enterprise contract announcements |
| Frontier model labs negotiate parallel journal licensing agreements | Low-Medium | High | NEJM, JAMA, Cochrane licensing announcements with OpenAI, Anthropic, or Google |
How to Evaluate Any Clinical AI Tool: A Decision Framework for Clinicians
You are a third-year internal medicine resident at OLOLRMC. Your program director announces that a company is offering your residency program a free enterprise license for "MedSynth AI" — a new clinical decision support tool that answers point-of-care questions and suggests ICD-10 codes from your clinical notes. You are asked to evaluate it before the program commits. Here is how to do that correctly.
Before evaluating any tool, complete this sentence: "[Tool] reduces [specific metric] caused by [concrete problem] in [precise context]."
Specific metric. Concrete context. Plausible mechanism. This is a real pain point.
This is aspiration, not a pain point. What is broken right now? What number is suffering? Rephrase until you have a specific sentence.
Once you have the sentence, drill into three numbers before proceeding:
- Baseline today: What is the actual current number? (e.g., 25 min/session × 4 sessions/week = 100 min/week on coding)
- Target: What improvement justifies adopting this tool? Require at least 20–30% (e.g., reduce to 70 min/week)
- 14-day measurement plan: How will you know by day 14 if it worked? (Track actual coding time with a timer for two pre-pilot weeks and two pilot weeks)
- "It could improve efficiency" — could is potential, not reality. What does it do today for a real user?
- Multiple problems listed — pick one. Which single pain point does this solve?
- "We will figure out metrics later" — define the metric now or walk away. No metric = no accountability.
- No baseline number available — if you cannot measure the pain today, you cannot prove the tool fixed it in 14 days.
Assume the tool solves the pain. Now ask: can your environment actually run this in practice?
What behavior change does this require of you? MedSynth AI requires opening a tab, reviewing suggested codes, and accepting or modifying them before finalizing your note. That is a workflow interruption. Will you genuinely do this at 11pm post-call? Be honest before recommending it to others.
What happens when it breaks? It is 2am Saturday. The tool is down. Notes are due in three hours. Do you code manually? Does the tool save drafts somewhere? Who do you call? Walk through every step. If you cannot answer this, your program is not ready to depend on the tool.
- "We will need custom code to connect it to our EHR" — how much? Who maintains that code after the vendor rep moves on?
- "It requires manual checking a few times a day" — if more than twice a week, the maintenance load is too high for a residency environment.
- "We have not decided who owns it yet" — no owner means dead in three months. Name a person today or do not start.
- "It touches Epic, our coding software, and our billing system" — three integration points = three failure surfaces. Is the value worth that complexity?
Do not say "security incident" or "downtime." Name the concrete worst case — what actually breaks, what the damage is, and who gets hurt.
Vague (wrong): "It might give us wrong information."
Specific (right): "MedSynth AI suggests a 99215 E&M code with a plausible-sounding MDM rationale for a case that is genuinely a 99213. I am post-call, I review it quickly, and I accept it without re-reading the MDM rationale. The claim is submitted. It is audited three months later. I face a billing compliance finding. My attending is named in the review. The hospital pays a retroactive recoupment."
That is concrete. Plausible. Real consequences for real people.Once you name it, probe three defenses before recommending deployment:
- Architectural safeguard: What prevents this from happening? (e.g., the tool flags high-complexity E&M suggestions for mandatory review; the MDM rationale cannot be auto-accepted without a physician read)
- Monitoring: What catches it quickly if the safeguard fails? (e.g., coding audit log reviewed weekly; denial rate tracked monthly with anomaly alerting)
- Survivability: If all safeguards fail simultaneously, can your program survive this? (A single note compliance finding is survivable with documentation; systematic overcoding across 60 residents for six months is not)
- "That probably will not happen" — probably is not an acceptable risk framework for billing compliance or patient safety.
- "We will deal with it if it comes up" — design around it now, or do not deploy.
- The worst case involves a HIPAA violation, fraudulent billing, or patient harm AND you have no architectural safeguard — do not deploy. Full stop.
- No monitoring that catches silent errors within 72 hours — AI tools that fail silently are the most dangerous class in clinical settings.
Apply this framework directly to what you now know about OpenEvidence:
| Worst case scenario | OE architectural safeguard | Monitoring | Survivable? |
|---|---|---|---|
| Hallucinated drug dose cited confidently (BiPAP case) | Citation grounding only — does not verify dose against FDA label maximum | None published; depends on physician catching it | Depends on vigilance |
| Wrong E&M level + AI-written MDM rationale → audit triggered | CCI rules engine for code compatibility — not MDM accuracy | Standard billing audit; not AI-specific | Survivable but costly |
| Prior auth letter misses payer's step therapy requirement; patient waits weeks | None described for payer-specific denial criteria | Payer denial — delayed days to weeks | Patient harm risk from delay |
| Trainee enters patient name + MRN on free account without BAA | Privacy policy disclaims liability for non-BAA PHI input | None — OE cannot detect this in real time | HIPAA violation — not survivable cleanly |
The evaluation scorecard
Before recommending any AI tool to your program director, complete this scorecard. If any field is blank, the evaluation is not complete. Present this scorecard, not a paragraph of impressions, to your leadership.
Questions to ask the vendor before signing
| Question | Good answer | Red flag |
|---|---|---|
| "Show me observability from a real production deployment — not a demo." | Actual uptime logs, P95 latency, incident history from a comparable health system | "We can set up a demo" or "our platform is very reliable" |
| "Show me an actual export file of all data my institution generates." | A real export in a documented, portable format with field definitions | "We can discuss data access in our enterprise agreement" |
| "What broke for your last three customers who churned?" | Specific, honest post-mortems — what failed and what was done about it | "We have not had any churns" or deflection to a reference call |
| "What are your false positive and negative rates for [specific clinical function]?" | Published rates with confidence intervals, tuned to a similar clinical population | "Our accuracy is very high" without a number attached to it |
| "What is the cheapest way to get 80% of this value without your product?" | An honest, specific answer — a vendor confident in their differentiation can answer this directly | Offense, deflection, or "nothing else can do what we do" |
| "How is your advertising model governed relative to clinical content?" (OE-specific) | "The systems are architecturally separated, confirmed by this independent audit." | "We have a strict internal policy" without an external audit |
Recommend adoption only when you can honestly say yes to all of these. If you hesitate on any one, the default answer is no. Make the tool earn its way in.
If you apply this framework to OpenEvidence at your current rotation site, here is the honest scorecard: The pain point is real and measurable. The ownership question is unresolved at most Louisiana sites — no named institutional owner, no confirmed BAA at any of the four systems covered in this report. The worst-case scenarios are documented in Section 17. The guardrails are partial and unaudited. The 14-day metric is trackable. The pricing is zero for individual access. The lock-in risk is low for search; meaningfully higher if your institution embeds Coding Intelligence into its billing workflow. Use this as your baseline. Update it as your rotation site's AI governance framework matures.
DISCLAIMER & INSTITUTIONAL STANDING
This assessment is provided strictly for educational and informational purposes. The analysis, failure taxonomy, and strategic evaluations contained herein represent the professional observations of the author and do not constitute an official report, mandate, or clinical directive from the insitute
Usage of any AI tool should follow individual health system policies. This document does not establish institutional policy for Ochsner, LCMC, FMOL, or Lake Charles Memorial.
This report synthesizes 18 internal analytical documents, publicly available press releases and news reporting, peer-reviewed studies available via PubMed and medRxiv, and current web-searched information on Louisiana health system AI deployments. Financial figures are from public reporting and have not been independently verified. OpenEvidence was contacted for comment; no response was received before publication. This report does not represent the institutional position of any Louisiana health system or GME program. It is prepared for educational purposes and does not constitute clinical, legal, investment, or regulatory advice. The analyst has no financial relationship with OpenEvidence or any competing platform.
Key data sources: OpenEvidence press releases (2024–2026); Sacra equity research; MobiHealthNews; Healthcare IT News; Fierce Healthcare; Becker's Hospital Review; HealthLeaders Media; Verite News; Nabla Technologies press releases; FMOL Health / Ochsner Health / LCMC Health public communications; Cambridge Health Alliance NCT07199231; medRxiv preprint (MedXpertQA evaluation); 2025 Physicians AI Report; 2026 Hospitalist Survey (JMIR); Epic Systems User Group Meeting announcements 2025–2026.