Blinded Independent Central Review (BICR) is recommended by the US FDA for registration of oncology trials as image assessment bias is avoided and no chance of unblinding of patient data. Double read with adjudication is the method used to reduce endpoint assessment variability. In cases of disagreement between the readers, a third reader called an adjudicator, reviews the assessment by the two radiologists and decides which assessment is most accurate. Adjudication Rate (AR) and Adjudicator Agreement Rate (AAR) are the two indicators used to evaluate reviewer performance and overall trial variability and quality. Sentiment Analysis (SA) is based on natural language processing and can tag the data as ‘positive’, ‘negative’ or ‘neutral’ although current technologies can provide a more complex analysis of emotions in the written text. Medical SA can analyze patients’ and doctors’ opinions, sentiments, attitudes, and emotions in the clinical background. Python, the most frequently used programming language for deep learning worldwide and ChatGPT, an AI-based chatbot can be used for assessing adjudicator comment quality based on sentiment analysis. If successful, this analysis can open another novel implementation for Large Language Models (LLMs) or ChatGPT in clinical research and medical imaging. This prospective study involved the review of cases for 100 subjects by board-certified radiologists using the Response Evaluation Criteria in Solid Tumors (RECIST) 1.1 criteria. The study employed a double read with adjudication paradigm in a central imaging review setup. The agreement of adjudication was assessed and compared with the overall response, agreed reader, and medical text. The medical text entered by the adjudicator is usually a free text field that typically lacks standardization and control over its content, which may affect its correlation with reviewer selection for agreement. Although uncommon, errors by the adjudicator can occur due to ambiguous text, mis-clicks, or application delay errors. To analyze the adjudicator’s comments, sentiment analysis was conducted using a Python plug-in with ChatGPT as a large language model. Based on this analysis, the subjects were categorized as either having “Potential Error” or “No Error”. The algorithm supported by ChatGPT was evaluated against a Gold Standard, determined by a board-certified radiologist with over 20 years of experience in the BICR process. A comparison was made to assess accuracy and reproducibility, revealing that only four out of 100 subjects had different outcomes. The sensitivity was calculated as 0.857, specificity as 1.0, and accuracy as 0.96. The remarkable Natural Language Processing (NLP) capabilities of ChatGPT are evident in its ability to classify the sentiment as positive, negative, or neutral based on the free-text adjudicator comments provided during the review process. This classification enables a comparison with the actual assessment, adjudicator agreement, and overall patient outcome, highlighting the impressive performance of ChatGPT in this regard.
|