Price Prediction

Dangerous Diagnoses? GPT-4V’s Role in Medical Image Interpretation

Authors:

(1) Senthujan Senkaiahliyan M. Mgt, is with the Institute for Health Policy Management and Evaluation, Faculty of Public Health, University of Toronto and Peter Munk Cardiac Centre, University Health Network, Toronto ON, Canada;

(2) Augustin Toma MD, is with the Department of Medical Biophysics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada;

(3) Jun Ma PhD, is with Peter Munk Cardiac Centre, University Health Network; Department of Laboratory Medicine and Pathobiology, University of Toronto; Vector Institute, Toronto, ON Canada;

(4) An-Wen Chan MD, is with the Institute for Health Policy Management and Evaluation, Faculty of Public Health and with the Division of Dermatology, Department of Medicine, University of Toronto, Toronto, ON, Canada;

(5) Andrew Ha MD, is with Peter Munk Cardiac Centre, University Health Network and the Division of Cardiology, Department of Medicine, University of Toronto, Toronto, ON, Canada;

(6) Kevin R. An MD, is with the Division of Cardiac Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;

(7) Hrishikesh Suresh MD, is with the Division of Neurosurgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;

(8) Barry Rubin MD, is with Peter Munk Cardiac Centre, University Health Network and the Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;

(9) Bo Wang PhD (Corresponding Author) is with Peter Munk Cardiac Centre, University Health Network; Department of Laboratory Medicine and Pathobiology and Department of Computer Science, University of Toronto; Vector Institute, Toronto, Canada. E-mail: [email protected].

Abstract and 1. Introduction GPT-4V(ision)

2. Data Collection

3. Experimental Setup

4. Results and References

5. Discussion and Limitations, and References

Supplementary Notes

4. RESULTS

4.1 Performance on Multimodal Images

For multimodal images (Table 2), a total of 69 images were assessed. Several images were accompanied by multiple prompts, with each undergoing a separate assessment. The correct diagnostic label for all these images were provided to the clinician evaluator to ensure accuracy in assessment. Clinician evaluators were asked to identify if GPT-4V correctly interpreted the images and whether they felt that the interpretation given was correct and safe for patient care. The average comfort level the clinicians felt about letting medical students learn from these images was 1.8 ± 1.4 on a scale of 1-5. Out of the 69 images, only 15 were correctly interpreted with the correct advice. However, there were a concerning number of instances (30 out of 69) where dangerous advice was provided. The images spanned various modalities (Table 1), including CT scans of various body parts, ECG, MRI, CXR, and others.

TABLE 2Multimodal Images Summary of Results.TABLE 2Multimodal Images Summary of Results.

4.2. Performance on Electrocardiograms (Cardiology)

For ECG images (Table 3), 24 images were examined. The overall interpretation of these images had an average rating of 2.25 ± 1.07 out of 5. Notably, none of these interpretations matched the competence of standard automated ECG reads as determined by the cardiac electrophysiologist. Out of the 24, only 3 responses were considered helpful for medical student learning, and in 9 cases, dangerous advice for patient care was given.

TABLE 3ECG Summary of ResultsTABLE 3ECG Summary of Results

4.3 Performance on Clinical Photos (Dermatology)

For dermatology images (Table 4), out of the 49 images, the average quality of layman’s description of the rash was 3 ± 1.55 out of 5. The medical descriptions and differential diagnoses of the rash averaged at 2.5 ± 1.49 and 2 ± 1.46 out of 5, respectively. The comfort level of using GPT-4V as an education tool for medical students averaged at 2 ± 1.4 out of 5. In addition, the differential diagnosis was described by the dermatologist as lacking depth and containing inaccuracies or irrelevant conditions.

TABLE 4Clinical Photos Summary of ResultsTABLE 4Clinical Photos Summary of Results

Figure 5 highlights direct examples of GPT-4V responses to images used in the evaluation along with clinician comments. For both cases highlighted, clinician comments indicate that GPT-4V has provided inaccurate advice that can impact patient care.

Fig. 2. Evaluation of GPT-4V’s Interpretations on Medical Images with Expert FeedbackFig. 2. Evaluation of GPT-4V’s Interpretations on Medical Images with Expert Feedback

5. DISCUSSION AND LIMITATIONS

While GPT-4V demonstrates moderate proficiency in processing diverse medical imaging modalities and identifying specific features, it is important to note that the model occasionally falls short in recognizing overt findings. In addition, it’s important to consider that the public-facing version of GPT-4V, as part of alignment efforts to not explicitly provide directives, may have impacted its performance on certain medical tasks.

Nevertheless, this evaluation of GPT-4V is not without its limitations. Firstly, our utilization of public-facing images, which might have potentially been part of the model’s training datasets, should, in theory, have augmented its performance. Yet, GPT-4V’s performance, especially with these images, were poor. This raises concerns about the depth and diversity of its training dataset. Secondly, as we provided GPT-4V with standalone images devoid of a broader clinical context, we expected clinicians to consider this aspect when evaluating the model’s efficacy. It should be emphasized that diagnoses are not formed solely on a single picture and, in the absence of patient history, GPT-4V’s output should be evaluated with this consideration in mind.

The most glaring concern lies in the model’s accuracy, particularly with ECG interpretations. Instances where GPT-4V misinterprets severe conditions as benign poses significant risk for patient care. Without insight on the training datasets, a comprehensive evaluation will need to be conducted to uncover any harms in misrepresentation or potential bias. From our evaluation of GPT-4V’s performance, it’s evident that proprietary LLMs should strongly consider aligning with open-source principles. This is particularly crucial as many healthcare institutions are exploring collaborations with them for deployment in clinical and operational environments [5]. The Department of Health and Human Services within the United States is spearheading initiatives in this area, emphasizing the necessity for diverse and representative training data to ensure the ethical application of AI [6].

While LLMs have showcased the capability to tailor their responses based on user input and changing contexts, it’s noteworthy that our assessment was conducted during GPT-4V’s initial selective release. Since then, it appears that guardrails have been implemented to ensure that responses related to medical images remain generalized and descriptive rather than prescriptive.

Newer LLMs are being designed to address specific challenges within the medical field. An exemplar of this is Clinical Camel, a model that has been fine-tuned with medical datasets to enhance its performance significantly when addressing clinical inquiries, surpassing the capabilities of its pre-trained model [7]. With these developments, there’s an untapped potential for these models to become multimodal, offering a chance to develop comprehensive tools that support healthcare professionals provided they undergo thorough evaluation and validation in real-world clinical settings.

Considering the enthusiasm around Large Language Models (LLMs) and the suggestion that they will revolutionize the medical sphere, in our view GPT-4V’s current performance fails to offer merit to those claims. Our human evaluation substantiates healthcare regulatory bodies and OpenAI’s own advice on not using it as a substitute for clinician-based decision making [3]. While GPT-4V’s functionality as a multimodal foundation model—capable of processing both text and image inputs—is noteworthy, in its current form, significant concerns remain regarding its diagnostic accuracy and ability to interpret various medical image modalities.

REFERENCES

[1] A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023.

[2] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.

[3] OpenAI, “Gpt-4v(ision) system card,” 2023.

[4] J. N. Acosta, G. J. Falcone, P. Rajpurkar, and E. J. Topol, “Multimodal biomedical ai,” Nature Medicine, vol. 28, no. 9, pp. 1773–1784, 2022.

[5] A. J. Nashwan, A. A. AbuJaber, and A. AbuJaber, “Harnessing the power of large language models (llms) for electronic health records (ehrs) optimization,” Cureus, vol. 15, no. 7, 2023.

[6] B. Mesko and E. J. Topol, “The imperative for regulatory oversight of large language models (or generative ai) in healthcare,” ´ npj Digital Medicine, vol. 6, no. 1, p. 120, 2023.

[7] A. Toma, P. R. Lawler, J. Ba, R. G. Krishnan, B. B. Rubin, and B. Wang, “Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding,” arXiv preprint arXiv:2305.12031, 2023.

SUPPLEMENTARY NOTES

Below are additional case studies from the evaluation highlighting examples of GPT-4V’s output and comments from the evaluators.

Fig. 3. Case Study 1- MRIFig. 3. Case Study 1- MRI

Fig. 4. Case Study 2- CTFig. 4. Case Study 2- CT

Fig. 5. Case Study 3- ECGFig. 5. Case Study 3- ECG

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button