New to MyHealth?
Manage Your Care From Anywhere.
Access your health information from any device with MyHealth. You can message your clinic, view lab results, schedule an appointment, and pay your bill.
ALREADY HAVE AN ACCESS CODE?
DON'T HAVE AN ACCESS CODE?
NEED MORE DETAILS?
MyHealth for Mobile
Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability.
Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdominal radiology (New York) Cao, J. J., Kwon, D. H., Ghaziani, T. T., Kwo, P., Tse, G., Kesselman, A., Kamaya, A., Tse, J. R. 2024Abstract
To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score -?1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was >?0 and reliable if mean score was >?0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests.Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p?
View details for DOI 10.1007/s00261-024-04501-7
View details for PubMedID 39088019
View details for PubMedCentralID 10366809