1 February 2022 Joint attentive adaptive feature fusion network for action recognition
Fan Xia, Min Jiang, Jun Kong, Danfeng Zhuang
Author Affiliations +
Abstract

Joint spatio-temporal feature learning is the key to video-based action recognition. Off-the-shelf techniques mostly apply two-stream networks, and they either simply fuse the classification scores or only integrate the high-level features. However, these methods cannot learn inter-modality relationship well. We propose a joint attentive (JA) adaptive feature fusion (AFF) network, a three-stream network that improves inter-modality fusion by exploring complementary and interactive information of two modalities, RGB and optical flow. Specifically, we design an AFF block to implement layer-wise fusion between both modality network channels and feature levels, where spatio-temporal feature representations with different modalities and various levels can be fused effectively. To capture three-dimensional interaction of spatio-temporal features, we devise a JA module by incorporating the inter-dependencies learned with the spatial-channel attention mechanism and combine multi-scale attention to refine the fine-grained features. Extensive experiments on three public action recognition benchmark datasets demonstrate that our method achieves competitive results.

© 2022 SPIE and IS&T 1017-9909/2022/$28.00 © 2022 SPIE and IS&T
Fan Xia, Min Jiang, Jun Kong, and Danfeng Zhuang "Joint attentive adaptive feature fusion network for action recognition," Journal of Electronic Imaging 31(1), 013019 (1 February 2022). https://doi.org/10.1117/1.JEI.31.1.013019
Received: 27 August 2021; Accepted: 13 January 2022; Published: 1 February 2022
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

RGB color model

Convolution

3D modeling

Optical flow

Feature extraction

Network architectures

Back to Top