Study finds it difficult for leading AI models to identify genetic conditions from patient written descriptions

The following is translated from the original:

Researchers at the National Institutes of Health (NIH) have found that while artificial intelligence (AI) tools can make accurate diagnoses based on textbook-like descriptions of genetic diseases, they are significantly less accurate when analyzing summaries written by patients about their own health.

The findings, reported in the American Journal of Human Genetics, suggest that these artificial intelligence tools need to be improved before they can be used in healthcare settings to help diagnose and answer patient questions.

Researchers studied a type of artificial intelligence called the Large Language Model, which is trained based on large amounts of text-based data. These models have the potential to be very helpful in medicine because of their ability to analyze and answer questions and the interface is generally user-friendly.

“We may not always think so, but a lot of medicine is text-based,” said Ben Solomon, M.D., senior author of the study and clinical director of the National Human Genome Institute (NHGRI) at the National Institutes of Health.

“For example, electronic health records and conversations between doctors and patients are made up of words. Large language models are a huge leap forward for artificial intelligence, and being able to analyze words in clinically useful ways could be incredibly transformative.”

The researchers tested 10 different large language models, including the two latest versions of ChatGPT. The researchers designed questions about 63 different genetic conditions based on medical textbooks and other references. These include some well-known diseases such as sickle cell anemia, cystic fibrosis and Marfan syndrome, as well as many rare genetic diseases.

These conditions can appear in multiple ways in different patients, and the researchers aim to capture some of the most common possible symptoms.

They chose three to five symptoms for each condition and asked the question in a standard format,”I have X, Y, and Z symptoms. What is the most likely genetic condition?”

When these questions are asked, large language models point out that the ability to correctly diagnose genes varies widely, with initial accuracy rates ranging from 21% to 90%. The best-performing model is GPT-4, one of the latest versions of ChatGPT.

The success of a model usually corresponds to its size, that is, the amount of data used to train the model. The smallest model has billions of parameters to extract, while the largest model has more than a trillion parameters.

For many of the lower-performing models, researchers were able to improve accuracy in subsequent experiments, and overall, these models still provide more accurate responses than non-artificial intelligence technologies, including standard Google Search.

Researchers optimized and tested the model in a variety of ways, including replacing medical terms with more generic language. For example, the question is not that the child has a “macrocephaly”, but that the child has a “macrocephaly”, which more closely reflects how the patient or caregiver describes the symptoms to the doctor.

Overall, when medical descriptions are deleted, the accuracy of the model declines. However, when using a common language, seven out of ten models are still more accurate than Google searches.

“It’s important that people without medical knowledge can use these tools,” said Kendall Flaharty, a post-bachelor fellow at NHGRI who led the study.

“There are not many clinical geneticists in the world, and in some states and countries, people don’t have access to these experts. Artificial intelligence tools can help people get answers to some questions without having to wait years for appointments.”

To test the efficacy of the big-language model using information from real patients, the researchers asked patients at the NIH clinical center to provide short descriptions of their own genetic status and symptoms. These descriptions range from one sentence to several paragraphs, and are more diverse in style and content than textbook-like questions.

When these descriptions of real patients were provided, the best-performing models made an accurate diagnosis only 21% of the time. Many models perform much worse, with accuracy as low as 1%.

Researchers expect patient-written abstracts to be more challenging because patients at NIH clinical centers often suffer from extremely rare conditions. Therefore, the model may not have enough information about these conditions to diagnose.

However, accuracy improved when researchers wrote standardized questions for the same extremely rare genetic conditions found in NIH patients. This suggests that the model has difficulty interpreting the variable wording and format of patient records, possibly because the model is trained on textbooks and other reference materials, which tend to be more concise and standardized.

“For these models to be clinically useful in the future, we need more data, and this data needs to reflect patient diversity,” Dr. Solomon said.

“We need to represent not only all known medical conditions, but also changes in age, race, gender, cultural background, etc., so that the data captures the diversity of patient experiences. These models can then learn how different people might talk about their conditions.”

In addition to showing areas for improvement, the study also highlights the limitations of current large language models and the continuing need for human supervision when AI is applied to health care.

“These techniques are already being promoted in clinical settings,” Dr. Solomon added. “The biggest question is no longer whether clinicians will use AI, but where and how clinicians should use AI, and where we should not use AI to provide the best care for patients.”

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Original text:https://medicalxpress.com/news/2024-08-ai-struggle-genetic-conditions-patient.html

More information: Evaluating Large Language Models on Medical, Lay Language, and Self-Reported Descriptions of Genetic Conditions, The American Journal of Human Genetics (2024). DOI: 10.1016/j.ajhg.2024.07.011. www.cell.com/ajhg/fulltext/S0002-9297(24)00255-6

Journal information: American Journal of Human Genetics

Oil tubing: