Joint spatio-temporal feature learning is the key to video-based action recognition. Off-the-shelf techniques mostly apply two-stream networks, and they either simply fuse the classification scores or only integrate the high-level features. However, these methods cannot learn inter-modality relationship well. We propose a joint attentive (JA) adaptive feature fusion (AFF) network, a three-stream network that improves inter-modality fusion by exploring complementary and interactive information of two modalities, RGB and optical flow. Specifically, we design an AFF block to implement layer-wise fusion between both modality network channels and feature levels, where spatio-temporal feature representations with different modalities and various levels can be fused effectively. To capture three-dimensional interaction of spatio-temporal features, we devise a JA module by incorporating the inter-dependencies learned with the spatial-channel attention mechanism and combine multi-scale attention to refine the fine-grained features. Extensive experiments on three public action recognition benchmark datasets demonstrate that our method achieves competitive results. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 1 scholarly publication.
Video
RGB color model
Convolution
3D modeling
Optical flow
Feature extraction
Network architectures