Vision-based ChatGPT shows flaws in interpreting radiographic images

Researchers evaluating the performance of ChatGPT-4 Vision found that the model performed well on text-based radiology exam questions, but had difficulty answering image-related questions accurately. The study results were published in the journal Radiology.

Chat GPT-4 Vision is the first version of a large language model that can interpret text and images.

Chad Klochko, M.D., musculoskeletal radiologist and artificial intelligence (AI) researcher at Henry Ford Health, said: “ChatGPT-4 has shown promise in assisting radiologists in tasks such as streamlining patient-oriented radiology reports and determining appropriate imaging examination protocols.” Detroit, Michigan. “With its image processing capabilities, GPT-4 Vision can realize new potential applications in the field of radiology.”

For this study, Dr. Klochko’s team used retirement questions from the American College of Radiology’s Diagnostic Radiology Training Examination, a series of tests used to measure the progress of radiology residents. After excluding duplicates, the researchers used 377 questions in 13 domains, including 195 text-only questions and 182 questions that contained images.

GPT-4 Vision correctly answered 246 of the 377 questions, with a total score of 65.3%. The model correctly answered 81.5% (159) of 195 plain text queries and 47.8% (87) of 182 questions with images.

“The 81.5% accuracy rate for text-only questions reflects the performance of the model’s predecessor,” he said. “This consistency in text-based questions may indicate that the model has a degree of textual understanding in radiology.”

Urogenital radiology is the only subspecialty where GPT-4 Vision performs better on image questions (10 of 15 cases) than text-only questions (57% or 4 of 7 cases). This model performs better on text-only questions for all other sub-majors.

The model performed best on image-based questions in the thoracic and urogenital subspecialties, correctly answering 69% and 67% of questions containing images, respectively. This model performed the lowest on questions containing images in the field of nuclear medicine, answering only 2 of the 10 questions correctly.

The study also evaluated the impact of various hints on the performance of GPT-4 Vision.

Original: You are taking a radiology exam. An image of the problem will be uploaded. Choose the correct answer to each question.
Basic: Choose the best answer from the following retirement radiology exam questions.
Short explanation: This is a retired radiology exam question designed to measure your medical knowledge. Choose the best answer letter and do not provide any reasoning for answers.
Details: You are a board certified diagnostic radiologist and are taking an exam. Evaluate each question carefully, and if the question also contains an image, evaluate the image carefully to answer the question. Your answer must include a choice of the best answer. Failure to provide answer options will be considered incorrect.
Thought: You are taking a retirement committee exam to conduct research. Based on the images provided, gradually think about the questions provided.
Although the model correctly answered 183 of the 265 questions under basic prompts, it refused to answer 120 questions, most of which contained images.

“The refusal to answer questions was something we had never seen when we first explored the model,” Dr. Klochko said.

Short command prompts produced the lowest accuracy rate (62.6%).

On text-based questions, thought chain prompts are 6.1% higher than long guides, 6.8% higher than basic prompts, and 8.9% higher than original prompts style. There is no evidence that any two hints perform differently on image-based questions.

“Our study shows evidence of hallucinatory responses when interpreting image results,” Dr. Klochko said. “We note a worrying trend in the model to provide correct diagnoses based on incorrect image interpretations, which could have significant clinical implications.”

Dr. Klochko said his findings highlight the need for more professional and rigorous assessment methods to assess the performance of large language models in radiology tasks.

“Given the current challenges in accurately interpreting key radiological images and trends in hallucination responses, the applicability of GPT-4 Vision in critical areas of information such as radiology is limited in its current state,” he said.

For more content, you can click on the link below the video
Thank you for watching this video. If you like it, please subscribe and like it. thank

Original text:https://medicalxpress.com/news/2024-09-vision-based-chatgpt-deficits-radiologic.html
More information: Performance of GPT-4 and Vision on Text and Image Based ACR Diagnostic Radiology Training Exam Questions, Radiology (2024).
Journal Information: Radiology

Oil tubing: