Leveraging LLMs like ChatGPT for robust quality checks and medical text agreement rationale: enhancing adjudication quality and alignment in BICR for oncology clinical trials

Manish Sharma; Samira Farough; Andre Burkett; Jerome Prasanth; Nabil El-Shafeey; Dominic Zygadlo; Chera Dunn; Ron Korn

doi:10.1117/12.3009153

2 April 2024 Leveraging LLMs like ChatGPT for robust quality checks and medical text agreement rationale: enhancing adjudication quality and alignment in BICR for oncology clinical trials

Manish Sharma, Samira Farough, Andre Burkett, Jerome Prasanth, Nabil El-Shafeey, Dominic Zygadlo, Chera Dunn, Ron Korn

Author Affiliations +

Proceedings Volume 12931, Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications; 1293103 (2024) https://doi.org/10.1117/12.3009153
Event: SPIE Medical Imaging, 2024, San Diego, California, United States

Abstract

Blinded Independent Central Review (BICR) is recommended by the US FDA for registration of oncology trials as image assessment bias is avoided and no chance of unblinding of patient data. Double read with adjudication is the method used to reduce endpoint assessment variability. In cases of disagreement between the readers, a third reader called an adjudicator, reviews the assessment by the two radiologists and decides which assessment is most accurate. Adjudication Rate (AR) and Adjudicator Agreement Rate (AAR) are the two indicators used to evaluate reviewer performance and overall trial variability and quality. Sentiment Analysis (SA) is based on natural language processing and can tag the data as ‘positive’, ‘negative’ or ‘neutral’ although current technologies can provide a more complex analysis of emotions in the written text. Medical SA can analyze patients’ and doctors’ opinions, sentiments, attitudes, and emotions in the clinical background. Python, the most frequently used programming language for deep learning worldwide and ChatGPT, an AI-based chatbot can be used for assessing adjudicator comment quality based on sentiment analysis. If successful, this analysis can open another novel implementation for Large Language Models (LLMs) or ChatGPT in clinical research and medical imaging. This prospective study involved the review of cases for 100 subjects by board-certified radiologists using the Response Evaluation Criteria in Solid Tumors (RECIST) 1.1 criteria. The study employed a double read with adjudication paradigm in a central imaging review setup. The agreement of adjudication was assessed and compared with the overall response, agreed reader, and medical text. The medical text entered by the adjudicator is usually a free text field that typically lacks standardization and control over its content, which may affect its correlation with reviewer selection for agreement. Although uncommon, errors by the adjudicator can occur due to ambiguous text, mis-clicks, or application delay errors. To analyze the adjudicator’s comments, sentiment analysis was conducted using a Python plug-in with ChatGPT as a large language model. Based on this analysis, the subjects were categorized as either having “Potential Error” or “No Error”. The algorithm supported by ChatGPT was evaluated against a Gold Standard, determined by a board-certified radiologist with over 20 years of experience in the BICR process. A comparison was made to assess accuracy and reproducibility, revealing that only four out of 100 subjects had different outcomes. The sensitivity was calculated as 0.857, specificity as 1.0, and accuracy as 0.96. The remarkable Natural Language Processing (NLP) capabilities of ChatGPT are evident in its ability to classify the sentiment as positive, negative, or neutral based on the free-text adjudicator comments provided during the review process. This classification enables a comparison with the actual assessment, adjudicator agreement, and overall patient outcome, highlighting the impressive performance of ChatGPT in this regard.

Conference Presentation

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Manish Sharma, Samira Farough, Andre Burkett, Jerome Prasanth, Nabil El-Shafeey, Dominic Zygadlo, Chera Dunn, and Ron Korn "Leveraging LLMs like ChatGPT for robust quality checks and medical text agreement rationale: enhancing adjudication quality and alignment in BICR for oncology clinical trials", Proc. SPIE 12931, Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications, 1293103 (2 April 2024); https://doi.org/10.1117/12.3009153

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available