7 April 2023 VioNets: efficient multi-modal fusion method based on bidirectional gate recurrent unit and cross-attention graph convolutional network for video violence detection
Wuyan Liang, Xiaolong Xu, Xiao Fu
Author Affiliations +
Abstract

Recent research on video violence detection has made great progress with the development of multi-modal fusion techniques. However, existing approaches still pose huge challenges in real-time violence detection due to fusion features being fixed and not able to be further fine tuned. We aim to address this challenge by exploring the multi-modal fusion method for analyzing multiple modal violent information, i.e., audio, optical flows, and RGB images. We propose a unified network called VioNets, which contains both a cross-attention graph convolutional network (GCN) module and a bidirectional gate recurrent unit (Bi-GRU) module for fusing different modalities of information. First, the cross-attention GCN module is utilized to extract the cross-modal spatial–temporal features. The Bi-GRU module is then applied to accurately capture both past and future context features for each time step of the single-modal features. As a result, the model retains important single-modal information in the extracted features while using the cross-modal features to improve the detection accuracy. Experiments conducted on the XD-Violence dataset show that the proposed method achieves an average precision of 80.59% and an inference time of 0.16 s with 1.82 M parameters.

© 2023 SPIE and IS&T
Wuyan Liang, Xiaolong Xu, and Xiao Fu "VioNets: efficient multi-modal fusion method based on bidirectional gate recurrent unit and cross-attention graph convolutional network for video violence detection," Journal of Electronic Imaging 32(2), 023031 (7 April 2023). https://doi.org/10.1117/1.JEI.32.2.023031
Received: 28 November 2022; Accepted: 21 March 2023; Published: 7 April 2023
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

RGB color model

Video surveillance

Feature extraction

Optical flow

Visualization

Feature fusion

RELATED CONTENT

Deep learning in person re-identification: a survey
Proceedings of SPIE (July 19 2024)
Tensor-based spatiotemporal saliency detection
Proceedings of SPIE (March 08 2018)
A template matching acceleration algorithm based on Cuda
Proceedings of SPIE (August 09 2018)
Scalable hierarchical video summary and search
Proceedings of SPIE (January 01 2001)

Back to Top