Performance of ChatGPT on free-response, clinical reasoning exams. medRxiv : the preprint server for health sciences Strong, E., DiGiammarino, A., Weng, Y., Basaviah, P., Hosamani, P., Kumar, A., Nevins, A., Kugler, J., Hom, J., Chen, J. H. 2023

Abstract

Studies show that ChatGPT, a general purpose large language model chatbot, could pass the multiple-choice US Medical Licensing Exams, but the model's performance on open-ended clinical reasoning is unknown.To determine if ChatGPT is capable of consistently meeting the passing threshold on free-response, case-based clinical reasoning assessments.Fourteen multi-part cases were selected from clinical reasoning exams administered to pre-clerkship medical students between 2019 and 2022. For each case, the questions were run through ChatGPT twice and responses were recorded. Two clinician educators independently graded each run according to a standardized grading rubric. To further assess the degree of variation in ChatGPT's performance, we repeated the analysis on a single high-complexity case 20 times.A single US medical school.ChatGPT.Passing rate of ChatGPT's scored responses and the range in model performance across multiple run throughs of a single case.12 out of the 28 ChatGPT exam responses achieved a passing score (43%) with a mean score of 69% (95% CI: 65% to 73%) compared to the established passing threshold of 70%. When given the same case 20 separate times, ChatGPT's performance on that case varied with scores ranging from 56% to 81%.ChatGPT's ability to achieve a passing performance in nearly half of the cases analyzed demonstrates the need to revise clinical reasoning assessments and incorporate artificial intelligence (AI)-related topics into medical curricula and practice.

View details for DOI 10.1101/2023.03.24.23287731

View details for PubMedID 37034742

View details for PubMedCentralID PMC10081420