Traditional methods to analyze video comprising human action are based on the extraction of the spatial content from image frames and the temporal variation across image frames of the video. Typically, images contain redundant spatial content and temporal invariance. For instance, background data and stationary objects represent redundant spatial content as well as temporal invariance. The redundancy leads to increase in storage requirements and the computation time. This paper focuses on the analysis on the key point data obtained from the capture of body movement, hand gestures, and facial expression in video-based sign language recognition. The key point data is obtained from OpenPose. OpenPose provides two-dimensional estimates of the human pose for multiple persons in real-time. In this paper, the K-means cluster method is applied to the key point data. The K-means cluster method selects the key frames based on the number of centroids formed from the key point data. The method described in this paper generates the data required for deep learning applications.
|