Lipreading feature extraction is essentially the feature extraction of continuous video frame sequences. A lipreading model based on a two-way convolutional neural network and features is proposed to obtain more reasonable visual-spatial–temporal characteristics. Unlike other lipreading methods based on deep learning, the rank pooling method transforms lip video into a standard RGB image that can be directly input into the convolutional neural network, which effectively reduces the input dimension. In addition, to compensate for the lack of spatial information, the apparent shape and depth features are fused, and then the joint cost function is used to guide the network model learning to obtain more distinguishing features. The experimental results were evaluated on the public GRID database and OuluVS2 database. It shows that the accuracy of the proposed method can reach more than 93%, which validates the effectiveness of the method. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 1 scholarly publication.
Laser induced plasma spectroscopy
Video
Convolutional neural networks
Feature extraction
Databases
RGB color model
Information visualization