July 14, 2023

Benchmarking AI's ability to answer medical questions

A benchmark for assessing how well large language models (LLMs) can answer medical questions is presented in a paper published in Nature. The study, from Google Research, also introduces Med-PaLM, an LLM specialized for the medical domain. The authors note, however, that many limitations must be overcome before LLMs can become viable for clinical applications.

Artificial intelligence (AI) models have potential uses in medicine, including knowledge retrieval and clinical decision support. However, existing models may, for instance, hallucinate convincing medical misinformation or incorporate biases that could exacerbate health disparities. Therefore, assessments of their clinical knowledge are needed. However, these assessments typically rely on automated evaluations on limited benchmarks, such as scores on individual medical tests, which may not translate to real-world reliability or value.

To evaluate how well LLMs encode clinical knowledge, Karan Singhal, Shekoofeh Azizi, Tao Tu, Alan Karthikesalingam, Vivek Natarajan and colleagues considered the ability of these models to answer medical questions.

The authors present a benchmark called MultiMedQA, which combines six existing question answering datasets spanning professional medicine, research and consumer queries, and HealthSearchQA, a new dataset of 3,173 medical questions commonly searched online.

The authors then evaluated the performance of PaLM (a 540-billion parameter LLM) and its variant, Flan-PaLM. They found that Flan-PaLM achieved state-of-the-art performance on several of the datasets. On the MedQA dataset comprising US Medical Licensing Exam-style questions, FLAN-PaLM exceeded previous state-of-the-art LLMs by more than 17%. However, while FLAN-PaLM performed well on multiple choice questions, human evaluation revealed gaps in its long-form answers to consumer medical questions.

To resolve this, the authors used a technique called instruction prompt tuning to further adapt Flan-PaLM to the medical domain. Instruction prompt tuning is introduced as an efficient approach for aligning generalist LLMs to new specialist domains.

Their resulting model, Med-PaLM, performed encouragingly in the pilot evaluation. For example, a panel of clinicians judged only 61.9% of Flan-PaLM long-form answers to be aligned with the scientific consensus, compared with 92.6% for Med-PaLM answers, on par with clinician-generated answers (92.9%). Similarly, 29.7% of Flan-PaLM answers were rated as potentially leading to harmful outcomes, in contrast to 5.8% for Med-PaLM, comparable with clinician-generated answers (6.5%).

The authors note that while their results are promising, further evaluations are necessary.

More information: Karan Singhal et al, Large language models encode clinical knowledge, Nature (2023). DOI: 10.1038/s41586-023-06291-2

Journal information: Nature

Provided by Nature Publishing Group

Citation: Benchmarking AI's ability to answer medical questions (2023, July 14) retrieved 28 April 2024 from https://medicalxpress.com/news/2023-07-benchmarking-ai-ability-medical.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Google AI health chatbot passes US medical exam: study

22 shares

Feedback to editors

Research shows 'profound' link between dietary choices and brain health

15 hours ago

Component of keto diet plus immunotherapy may reduce prostate cancer

19 hours ago

Study finds big jump in addiction treatment at community health clinics

19 hours ago

Positive childhood experiences can boost mental health and reduce depression and anxiety in teens

20 hours ago

Gene linked to epilepsy and autism decoded in new study

Apr 26, 2024

Blood test finds knee osteoarthritis up to eight years before it appears on X-rays

Apr 26, 2024

Researchers find pregnancy cytokine levels impact fetal brain development and offspring behavior

Apr 26, 2024

Study finds biomarkers for psychiatric symptoms in patients with rare genetic condition 22q

Apr 26, 2024

Clinical trial evaluates azithromycin for preventing chronic lung disease in premature babies

Apr 26, 2024

Scientists report that new gene therapy slows down amyotrophic lateral sclerosis disease progression

Apr 26, 2024

Load comments (0)

Benchmarking AI's ability to answer medical questions

Research shows 'profound' link between dietary choices and brain health

Component of keto diet plus immunotherapy may reduce prostate cancer

Study finds big jump in addiction treatment at community health clinics

Positive childhood experiences can boost mental health and reduce depression and anxiety in teens

Gene linked to epilepsy and autism decoded in new study

Blood test finds knee osteoarthritis up to eight years before it appears on X-rays

Researchers find pregnancy cytokine levels impact fetal brain development and offspring behavior

Study finds biomarkers for psychiatric symptoms in patients with rare genetic condition 22q

Clinical trial evaluates azithromycin for preventing chronic lung disease in premature babies

Scientists report that new gene therapy slows down amyotrophic lateral sclerosis disease progression

Google AI health chatbot passes US medical exam: study

ChatGPT flunks self-assessment test for urologists

Q&A: ChatGPT answers common patient questions about colonoscopy

ChatGPT takes on the tough US medical licensing exam

Q&A: Three questions on ChatGPT and medicine

Amazon brings palm-swiping tech to Red Rocks concert venue

How buildings influence the microbiome and human health

Study finds vitamin D alters mouse gut bacteria to give better cancer immunity

Study reports new compound that halts replication of COVID by targeting 'Mac-1' protein in cell models

Using AI to improve diagnosis of rare genetic disorders

Researchers create an AI-powered digital imaging system to speed up cancer biopsy results

Cancer drug trial provides lessons for future

Phys.org

Tech Xplore

Science X

Benchmarking AI's ability to answer medical questions

Research shows 'profound' link between dietary choices and brain health

Component of keto diet plus immunotherapy may reduce prostate cancer

Study finds big jump in addiction treatment at community health clinics

Positive childhood experiences can boost mental health and reduce depression and anxiety in teens

Gene linked to epilepsy and autism decoded in new study

Blood test finds knee osteoarthritis up to eight years before it appears on X-rays

Researchers find pregnancy cytokine levels impact fetal brain development and offspring behavior

Study finds biomarkers for psychiatric symptoms in patients with rare genetic condition 22q

Clinical trial evaluates azithromycin for preventing chronic lung disease in premature babies

Scientists report that new gene therapy slows down amyotrophic lateral sclerosis disease progression

Related Stories

Google AI health chatbot passes US medical exam: study

ChatGPT flunks self-assessment test for urologists

Q&A: ChatGPT answers common patient questions about colonoscopy

ChatGPT takes on the tough US medical licensing exam

Q&A: Three questions on ChatGPT and medicine

Amazon brings palm-swiping tech to Red Rocks concert venue

Recommended for you

How buildings influence the microbiome and human health

Study finds vitamin D alters mouse gut bacteria to give better cancer immunity

Study reports new compound that halts replication of COVID by targeting 'Mac-1' protein in cell models

Using AI to improve diagnosis of rare genetic disorders

Researchers create an AI-powered digital imaging system to speed up cancer biopsy results

Cancer drug trial provides lessons for future

Newsletter sign up

Donate and enjoy an ad-free experience