Neural network-based video segmentation has proven effective in producing temporally-coherent segmentation and motion tracking of heart substructures in echocardiography. However, prior methods confine analysis to half-heartbeat systolic phase clips from end-diastole (ED) to end-systole (ES), requiring the specification of these frames in the video and limiting clinical applicability. Here we introduce CLAS-FV, a fully automated framework that extends upon this prior work, providing joint semantic segmentation and motion tracking in multi-beat echocardiograms. Our framework first employs a modified R2+1D ResNet stem, which is efficient in encoding spatiotemporal features, and further leverages sliding windows for both training and test time augmentation to accommodate the full cardiac cycle. First, through 10-fold cross-validation on the half-beat CAMUS dataset, we show that the R2+1D-based stem outperforms the prior 3D U-Net both in Dice overlap for all substructures, and in derived clinical indices of ED and ES ventricular volumes and ejection fraction (EF). Next, we use the large clinical EchoNet-Dynamic dataset to extend our framework to full multi-beat video segmentation. We obtain mean Dice overlap of 0.94/0.91 on left ventricle endocardium in ED/ES phases, and accurately infer EF (mean absolute error 5.3%) over 1269 test patients. The presented multi-heartbeat video segmentation framework promises fast and coherent segmentation and motion tracking for the rich phenotypic analysis of echocardiography.
|