AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

1Stanford University 2Johns Hopkins University 3Hospital Israelita Albert Einstein
Main Figure AgentClinic

AgentClinic turns static medical QA problems into agents in a clinical environment in order to present a more clinically relevant challenge for medical language models.

Abstract

We present the first open-source benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information---such as which tests to perform---and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. Agents in AgentClinic-MedQA are grounded in cases from the US Medical Licensing Exam~(USMLE) and AgentClinic-NEJM are grounded in multimodal New England Journal of Medicine (NEJM) case challenges. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents.

Incorporating Bias

AgentClinic allows for the perturbation of language agents via a set of 24 biases. Biases can be introduced to both the doctor and patient agents. Currently, 24 biases are supported and the source code for AgentClinic allows for simple integration of custom biases!

Bias Figure AgentClinic

LLM Benchmarking

AgentClinic allows for easy benchmarking of new language models. Currently, any model on HuggingFace, OpenAI, or Replicate can be tested simply by passing the model string for any of the agents in AgentClinic (doctor, patient, moderator, measurement)! Additionally, custom models not on HuggingFace, OpenAI, or Replicate are supported through custom wrappers.

Bias Figure AgentClinic

BibTeX

@misc{schmidgall2024agentclinic,
      title={AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments}, 
      author={Samuel Schmidgall and Rojin Ziaei and Carl Harris and Eduardo Reis and Jeffrey Jopling and Michael Moor},
      year={2024},
      eprint={2405.07960},
      archivePrefix={arXiv},
      primaryClass={cs.HC}
}