Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model.

Douglas Johnson, Rachel Goodman, J Patrinely, Cosby Stone, Eli Zimmerman, Rebecca Donald, Sam Chang, Sean Berkowitz, Avni Finn, Eiman Jahangir, Elizabeth Scoville, Tyler Reese, Debra Friedman, Julie Bastarache, Yuri van der Heijden, Jordan Wright, Nicholas Carter, Matthew Alexander, Jennifer Choe, Cody Chastain, John Zic, Sara Horst, Isik Turker, Rajiv Agarwal, Evan Osmundson, Kamran Idrees, Colleen Kieman, Chandrasekhar Padmanabhan, Christina Bailey, Cameron Schlegel, Lola Chambless, Mike Gibson, Travis Osterman, Lee Wheless

Research Square 2023 Februrary 29

BACKGROUND: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known.

METHODS: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 - completely incorrect to 6 - completely correct) and completeness (3-point Likert scale; range 1 - incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing.

RESULTS: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost completely correct). Median completeness score was 3 (complete and comprehensive) with mean score of 2.5. For questions rated easy, medium, and hard, median accuracy scores were 6, 5.5, and 5 (mean 5.0, 4.7, and 4.6; p=0.05). Accuracy scores for binary and descriptive questions were similar (median 6 vs. 5; mean 4.9 vs. 4.7; p=0.07). Of 36 questions with scores of 1-2, 34 were re-queried/re-graded 8-17 days later with substantial improvement (median 2 vs. 4; p<0.01).

CONCLUSIONS: ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists although with important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Full text links

We have located links that may give you full text access.

Show additional links to paperHide additional links to paper

PubMed

Add to Saved Papers

Get 1-tap access

Related Resources

Consensus Statement on Vitamin D Status Assessment and Supplementation: Whys, Whens, and Hows.Andrea Giustina et al.Endocrine Reviews 2024 April 28

The Tricuspid Valve: A Review of Pathology, Imaging, and Current Treatment Options: A Scientific Statement From the American Heart Association.Laura J Davidson et al.Circulation 2024 April 26

Intravenous infusion of dexmedetomidine during the surgery to prevent postoperative delirium and postoperative cognitive dysfunction undergoing non-cardiac surgery: a meta-analysis of randomized controlled trials.Di Wang et al.European Journal of Medical Research 2024 April 19

Interstitial Lung Disease: A Review.Toby M MaherJAMA 2024 April 23

Ventilator Waveforms May Give Clues to Expiratory Muscle Activity.Yi Chi, Huaiwu He, Yun LongAmerican Journal of Respiratory and Critical Care Medicine 2024 April 25

Acute Kidney Injury and Electrolyte Imbalances Caused by Dapagliflozin Short-Term Use.António Cabral Lopes et al.Pharmaceuticals 2024 March 27

Systemic lupus erythematosus.Alberta Hoi et al.Lancet 2024 April 18

Colorectal polypectomy and endoscopic mucosal resection: European Society of Gastrointestinal Endoscopy (ESGE) Guideline - Update 2024.Monika Ferlitsch et al.Endoscopy 2024 April 27

Drug Therapy for Acute and Chronic Heart Failure with Preserved Ejection Fraction with Hypertension: A State-of-the-Art Review.Hiroaki Hiraiwa et al.American Journal of Cardiovascular Drugs : Drugs, Devices, and Other Interventions 2024 April 5

For the best experience, use the Read mobile app

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

All material on this website is protected by copyright, Copyright © 1994-2024 by WebMD LLC.
This website also contains material copyrighted by 3rd parties.

By using this service, you agree to our terms of use and privacy policy.

Your Privacy Choices

You can now claim free CME credits for this literature searchClaim now

Get seemless 1-tap access through your institution/university

For the best experience, use the Read mobile app

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model.

Full text links

Related Resources

Trending Papers

For the best experience, use the Read mobile app