Vital signs can be inferred from facial videos for health monitoring remotely, while facial videos can be easily obtained through phone cameras, webcams, or surveillance systems. In this study, we propose a hybrid deep learning model to estimate heart rate (HR) and blood oxygen saturation level (SpO2) from facial videos. The hybrid model has a mixed network architecture consisting of convolutional neural network (CNN), convolutional long short-term memory (convLSTM), and video vision transformer (ViViT). Temporal resolution is emphasized in feature extraction since both HR and SpO2 are varying over time. A clip of video consists of a set of frame images within a time segment. CNN is performed with regard to each frame (e.g., time distributed), convLSTM and ViViT can be configured to process a sequence of frames. These high-resolution temporal features are combined to predict HR and SpO2, which are expected to capture these signal variations. Our vital video dataset is fairly large by including 891 subjects from difference races and ages. Facial detection and data normalization are performed in preprocessing. Our experiments show that the proposed hybrid model can predict HR and SpO2 accurately. In addition, those models can be extended to infer HR fluctuations, respiratory rates, and blood pressure variations from facial videos.
|