Published in npj Digital Medicine (2026)

AgentClinic: a multimodal benchmark for tool-using clinical AI agents

1Dept. of ECE, Johns Hopkins University 2Dept. of CS, Johns Hopkins University 3Dept. of BME, Johns Hopkins University 4Dept. of CS, Stanford University 5Dept. of Radiology, Stanford University 6Hospital Israelita Albert Einstein 7Dept. of Surgery, Johns Hopkins Hospital 8Dept. of Biosystems Science and Engineering, ETH Zürich
AgentClinic architecture: simulation flow with doctor, patient, measurement, and moderator agents, agent tools, and an example clinical simulation

Figure 1. AgentClinic simulates clinical encounters with four interacting agents (doctor, patient, measurement, moderator) and equips the doctor agent with clinical tools. The right panel shows an example simulation where the doctor agent diagnoses cavernous sinus thrombosis through history-taking, scan requests, and multimodal data interpretation.

Abstract

Evaluating large language models (LLM) in clinical scenarios is crucial to assessing their potential clinical utility. Existing benchmarks rely heavily on static question-answering, which does not accurately depict the complex, sequential nature of clinical decision-making. Here, we introduce AgentClinic, a multimodal agent benchmark for evaluating LLMs in simulated clinical environments that include patient interactions, multimodal data collection under incomplete information, and the usage of various tools, resulting in an in-depth evaluation across nine medical specialties and seven languages. We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy. Overall, we observe that agents sourced from Claude-3.5 outperform other LLM backbones in most settings. Nevertheless, we see stark differences in the LLMs' ability to make use of tools, such as experiential learning, adaptive retrieval, and reflection cycles. Strikingly, Llama-3 shows up to 92% relative improvements with the notebook tool that allows for writing and editing notes that persist across cases. To further scrutinize our clinical simulations, we leverage real-world electronic health records, perform a clinical reader study, perturb agents with biases, and explore patient-centric metrics that this interactive environment enables.

LLM Benchmarking

AgentClinic enables comprehensive benchmarking of LLMs as clinical agents. We evaluate a suite of state-of-the-art models across three settings: AgentClinic-MedQA performance based on doctor LLM, patient LLM, and AgentClinic-MIMIC-IV with real-world electronic health records. Claude-3.5 Sonnet achieves the highest diagnostic accuracy across settings, while human physicians perform comparably. The choice of patient agent LLM is a significant factor: performance varies substantially depending on which model simulates the patient.

LLM benchmarking results across AgentClinic-MedQA (doctor and patient models) and AgentClinic-MIMIC-IV

Figure 2. Diagnostic accuracy across AgentClinic-MedQA (varying doctor and patient LLMs) and AgentClinic-MIMIC-IV (real-world EHR). Claude-3.5 Sonnet leads in most settings; the patient agent LLM significantly impacts doctor performance.

Static QA vs. Sequential Decision-Making

A core finding of AgentClinic is that static medical QA benchmarks dramatically overestimate clinical competence. When the same MedQA problems are presented in AgentClinic's sequential decision-making format, diagnostic accuracies drop substantially across all models, in some cases to below a tenth of the original accuracy. This gap highlights the importance of evaluating clinical AI in interactive, agent-based settings rather than relying on multiple-choice benchmarks alone.

Comparison of MedQA static accuracy vs AgentClinic-MedQA agentic accuracy

Figure 3. Comparison of static MedQA accuracy (dashed) vs. AgentClinic-MedQA agentic accuracy (solid) across 11 LLMs. The sequential decision-making format reveals substantial performance drops for all models.

Cognitive and Implicit Bias Perturbation

AgentClinic embeds 24 cognitive and implicit biases in both doctor and patient agents to study realistic clinical interactions. Introducing biases leads to large reductions in diagnostic accuracy for doctor agents and reduced compliance, confidence, and follow-up consultation willingness in patient agents. The radar charts reveal that implicit biases (e.g., racial, gender, socioeconomic) and cognitive biases (e.g., confirmation, recency, frequency) differentially impact performance across GPT-4 and Mixtral-8x7B.

Bias perturbation results showing radar charts for doctor and patient biases, and bar charts for patient confidence, compliance, and consultation ratings

Figure 4. Impact of cognitive and implicit biases on AgentClinic-MedQA. Top: radar charts of normalized accuracy under doctor biases (implicit and cognitive) and patient biases. Bottom: patient-centric metrics (confidence, compliance, consultation willingness) under each bias type.

Multilingual, Multi-Specialty, and Tool-Augmented Evaluation

AgentClinic provides evaluation across seven languages (AgentClinic-Lang), nine medical specialties (AgentClinic-Spec), and five agent tools (including experiential learning, adaptive RAG, reflection cycles, and a persistent notebook). Claude-3.5 leads in most languages and specialties, while tool effectiveness varies dramatically by model. Strikingly, Llama-3 shows up to 92% relative improvement with the notebook tool that allows writing and editing notes that persist across cases.

Three radar charts showing diagnostic accuracy by language, specialty, and agent tool across six LLMs

Figure 5. Diagnostic accuracy across languages (left), medical specialties (center), and agent tools (right) for six LLMs. Claude-3.5 dominates most axes; tool augmentation reveals stark model-dependent differences.

Multimodal Evaluation: AgentClinic-NEJM

AgentClinic-NEJM evaluates LLMs on multimodal clinical cases from the New England Journal of Medicine case challenges, requiring agents to interpret medical images (radiology, pathology, dermatology) alongside dialogue. We compare performance at two stages: initial presentation and after measurement (imaging) data is provided. Claude-3.5 Sonnet again leads, but all models show notably lower accuracy on these challenging multimodal cases compared to text-only settings.

AgentClinic-NEJM multimodal evaluation showing medical image interpretation and accuracy across LLMs

Figure 6. AgentClinic-NEJM multimodal benchmark. Left: the doctor agent receives medical images (radiology, pathology) alongside patient context. Right: accuracy at initial presentation vs. after measurement data, comparing Claude-3.5, GPT-4, GPT-4o, and GPT-4o-mini.

BibTeX

@article{schmidgall2026agentclinic,
  title={AgentClinic: a multimodal benchmark for tool-using clinical AI agents},
  author={Schmidgall, Samuel and Ziaei, Rojin and Harris, Carl and Kim, Ji Woong and Reis, Eduardo Pontes and Jopling, Jeffrey and Moor, Michael},
  journal={npj Digital Medicine},
  year={2026},
  publisher={Nature Publishing Group},
  doi={10.1038/s41746-026-02674-7},
  url={https://www.nature.com/articles/s41746-026-02674-7}
}