Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study.

Ethan Goh, Robert Gallo, Jason Hom, Eric Strong, Yingjie Weng, Hannah Kerman, Josephine Cool, Zahir Kanjee, Andrew S Parsons, Neera Ahuja, Eric Horvitz, Daniel Yang, Arnold Milstein, Andrew P J Olson, Adam Rodman, Jonathan H Chen

medRxiv 2024 March 15

IMPORTANCE: Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.

OBJECTIVE: To assess the impact of the GPT-4 LLM on physicians' diagnostic reasoning compared to conventional resources.

DESIGN: Multi-center, randomized clinical vignette study.

SETTING: The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.

PARTICIPANTS: Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.

INTERVENTIONS: Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.

MAIN OUTCOMES AND MEASURES: The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.

RESULTS: 50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.

CONCLUSIONS AND RELEVANCE: In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

Full text links

We have located links that may give you full text access.

Show additional links to paperHide additional links to paper

PubMed

Add to Saved Papers

Get 1-tap access

Related Resources

For the best experience, use the Read mobile app

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

All material on this website is protected by copyright, Copyright © 1994-2024 by WebMD LLC.
This website also contains material copyrighted by 3rd parties.

By using this service, you agree to our terms of use and privacy policy.

Your Privacy Choices

You can now claim free CME credits for this literature searchClaim now

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app