Proper assessment of the ability of an artificial intelligence (AI)-enabled medical device to generalize to new patient populations is necessary to determine the safety and effectiveness of the device. Assessing AI generalizability relies on performance assessment on a data set which represents the device’s intended population, which can be challenging to obtain. An understanding of the AI model’s decision space can indicate how the device is likely to perform on patients not represented in the available data. Our tool for decision region analysis for generalizability (DRAGen) assessment estimates the composition of the region of the decision space surrounding the available data. This provides an indication of how the model is likely to perform on samples which are similar to, but not represented by, the available finite data set. DRAGen can be applied to any binary classification model and requires no knowledge of the model’s training process. In a case study, we demonstrated DRAGen on a COVID classification model and showed that the decision region composition can identify differences in correct classification rates between the positive and negative classes, even with comparable performance on the original test set. Performance evaluation using a data set which was not represented during model development nor within the original test set shows a disparity in the performance between COVID-positive and COVID-negative patients, as indicated by DRAGen. By releasing this tool, we encourage future AI developers to use our tool to improve understanding of generalizability.
|