Open Access Paper
28 December 2022 Research on video anomaly detection with variational auto-encoder based on multi-level memory enhancement
Hongmin Zhang, Xiaobing Fang, Xu Zhuang
Author Affiliations +
Proceedings Volume 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022); 125063L (2022) https://doi.org/10.1117/12.2662577
Event: International Conference on Computer Science and Communication Technology (ICCSCT 2022), 2022, Beijing, China
Abstract
Using autoencoders to reconstruct current video frames or predict future frames is a popular method in the task of video anomalous behavior detection based on weakly supervised learning. Recent studies have shown that introducing a memory module into an autoencoder can capture a limited number of normal patterns and cannot cope with new scenarios in the test set. Therefore, this paper proposes a dual-stream based multi-level memory-enhanced conditional variational autoencoder model (TS-MemCVAE), which uses RGB image and optical flow image dual-stream input, and adds memory modules at the bottle-neck. The memory module contains normal mode features of different sizes. At the same time, with the aid of optical flow information, the model can sensitively identify abnormal behaviors with large reconstruction errors. The model is divided into two parts: a multi-level memory-enhanced auto-encoder and a conditional variational autoencoder. The former is responsible for the reconstruction of the input video, and the latter is used to capture the high correlation between the reconstructed video and the optical flow image. further predictions. The model is validated on two benchmark datasets, UCSD Ped2 and CUHK Avenue, and achieves 95.83% and 84.16% on AUC, respectively, and its excellent performance proves the effectiveness of the model.

1.

INTRODUCTION

Video anomalous behavior detection refers to the detection of behavior or appearance patterns that are different from normal behavior patterns in a scene1, 2. Among the many video anomalous behavior detection methods based on autoencoders, the reconstruction or prediction-based method is one of the most popular methods currently studied. It assumes that the model only uses the features obtained from normal video training and cannot reconstruct or predict abnormal events that are not seen in the training stage. However, these methods do not take into account the shortcomings of the autoencoder - strong generalization ability, the autoencoder may only be composed of normal Event features can clearly produce video frames containing abnormal events. In order to constrain the generalization ability of the autoencoder as much as possible, a common solution is to use a dual-stream autoencoder to reconstruct the video image and the corresponding optical flow image. The gap between the images can be used to identify whether there is abnormal behavior in the video frame3-7; another solution is to add a memory module in the autoencoder to enhance the normality in the extracted features. Weights, suppress the expression of abnormal weights, so as to achieve the purpose of constraining the generalization of the autoencoder8, 9. It can be seen from the above two solutions that the performance of the abnormal behavior detection model based on the autoencoder largely depends on whether the autoencoder can be effectively restrained from being over-generalized.

Inspired by the above research, this paper combines a two-stream autoencoder with a memory module and proposes a multi-level memory-enhanced two-stream conditional variational autoencoder abnormal behavior detection model (TS-MemCVAE). The model performs reconstruction and prediction tasks on the input image, resulting in an output image. The model detects anomalous behavior by widening the gap between the input image and the output image by reconstructing the image twice.

2.

A TWO-STREAM CONDITIONAL VARIATIONAL AUTOENCODER MODEL WITH MULTILEVEL MEMORY AUGMENTATION (TS-MEMCVAE)

As shown in Figure 1, the multi-level memory-enhanced two-stream conditional variational autoencoder model (TS-MemCVAE) is a cascaded weakly supervised video anomalous behavior detection framework composed of two parts, including multi-level for video frame reconstruction. Level Memory Augmented Autoencoder (Multi-MemAE), and Two- Stream Conditional Variational Autoencoder (TS-CVAE) for video frame prediction. At training time, the TS-MemCVAE model is learned based on video frame data containing normal events, while at test time, the reconstruction error of the Multi-level Memory Augmented Autoencoder (Multi-MemAE) is similar to that of the Two-Stream Conditional Variational Autoencoder (TS-CVAE) prediction error, both of which are used together for video anomalous behavior detection.

Figure 1.

TS-MemCVAE structure diagram.

00131_PSISDG12506_125063L_page_2_1.jpg

The following will introduce the multi-level memory-enhanced autoencoder (Multi-MemAE) and the dual-stream conditional variational autoencoder (TS-CVAE), and finally describe the process of the model for abnormal video behavior detection.

2.1

Multilevel memory augmented autoencoder (Multi-MemAE)

The number of layers of the encoder and decoder of the multi-level memory-enhanced auto-encoder (Multi-MemAE) is designed to be four layers respectively, and a basic feature extraction module is placed in each layer of the encoder, followed by a downsampling layer, which is used for decoding. The basic feature processing module used in each layer in the decoder is the same as that of the encoder. The down-sampled features of each layer are fused with the up-sampled feature maps from the corresponding layers of the decoder respectively, and then cascaded through the basic features of each layer in turn. Feature processing module, the basic feature extraction module consists of convolutional layer-BN regularization layer-ReLu activation layer. Downsampling in the Multi-MemAE model is achieved by convolution pooling, and upsampling is achieved by deconvolution. The structure of the Multi-MemAE model is shown in Figure 2.

Figure 2.

Multi-MemAE structure diagram.

00131_PSISDG12506_125063L_page_2_2.jpg

The memory enhancement module is constructed as shown in Hu et al.9. When training the Multi-MemAE model, based on the idea of weak supervision, the video frame or its optical flow image can be used as input, and then the input image can be reconstructed. The L2 norm is used as the loss function of the model to penalize the distance between the reconstructed image and the input images.

00131_PSISDG12506_125063L_page_3_1.jpg
00131_PSISDG12506_125063L_page_3_2.jpg

λcand λd respectively represent the weight coefficients of feature separation and diversity in the memory module.

2.2

Two-stream conditional variational autoencoder (TS-CVAE)

The mathematical expression of the probability distribution of the convolutional neural network model based on the future frame prediction method can be considered as p(It+1 | I1:t), Future frames generated by the model are represented by It+1, I1:t represents a given continuous t frames of video images, It+1 is generated by I1:t. The model takes the RGB image and the optical flow supplementary information as an auxiliary generated image as input information. Its probability distribution can be assumed as p(It+1 | I1:t, H1:t), H1:t represents the optical flow image corresponding to the uninterrupted t video frames. The generation of the predicted future frame It+1 is jointly determined by the video frame It, and the corresponding optical flow image Ht, so it can be regarded as a direct mapping of I1:t and H1:t to Convolutional Neural Network model of Ii+1. Considering that the choice of t value in actual model training is usually small, generally between 4 and 6, the content of the video of consecutive t frames and the corresponding optical flow image are very similar, and the video duration is relatively short. According to p(It+1 | I1:t, H1:t), p(Ii+1 | I1:t, H1:t) assumed

to be a conditional variational autoencoder as a generative model for modeling the data, input continuous video frames into the conditional variational autoencoder to extract the latent variable Z of its internal distribution, and then generate the predicted future frame It+1 through Z, and use H1:t as a supplementary condition for the latent variable Z, in order to obtain the optimal distribution p, only when the evidence lower bound ELBO in the variational autoencoder is derived, when the KL divergence values of distribution p(Z|I1:t) and distribution p(It+1|Z) are the smallest, its mathematical expression is:

00131_PSISDG12506_125063L_page_3_3.jpg

According to the above equation (3), a two-stream conditional variational autoencoder (TS-CVAE) as shown in Figure 3 is proposed. The model consists of two encoders ERF and EFlow, and a decoder DFFP, the encoder EFlow encodes the optical flow image H1:t to obtain the encoded feature EFlow (H1:t), and then obtains the prior distribution p(Z|H1:t), and the encoder ERF is the video frame I1:t and the optical flow image H1:t together. The encoding yields ERF (I1:t, H1:t), and obtain its posterior distribution as q(Z|I1:t, H1:t). During training, the TS-CVAE model obtains the latent variable Z from the posterior distribution q(Z|I1:t, H1:t), and combines Z with EFlow (H1:t) and feeds it into the decoder to generate future frames.

Figure 3.

TS-CVAE structure diagram.

00131_PSISDG12506_125063L_page_4_2.jpg

Inspired by existing research, adding skip connections to the method of predicting future frames does not lead to a significant improvement in model performance, so skip connections are not used between encoder ERF and decoder DFFP. A variational autoencoder typically assumes the parameters p(It+1|Z, H1:t), q(Z | I1:t, H1:t), and p(Z|H1:t) in equation (3) to be Gaussian distributions. The loss function of conditional variational autoencoder consists of two parts: KL divergence value and prediction frame loss, and its calculation formula is as follows:

00131_PSISDG12506_125063L_page_4_1.jpg

In the equation, It+1 is the real video frame corresponding to the future frame. In order to make the prediction of future frames clearer, the gradient difference loss function is used to directly penalize the predicted image, and its specific form is as follows:

00131_PSISDG12506_125063L_page_4_3.jpg

where α is set to 1 in this experiment. The gradient difference loss function only performs a simple difference between adjacent pixels, in order to reduce the model training time. Combining LCVAE and Lgdl, the final loss function of the TS-CVAE model is as follows:

00131_PSISDG12506_125063L_page_4_4.jpg

Combining the loss functions of the Multi-MemAE part and the TS-CVAE part, the total loss function of the TS- MemCVAE model is as follows:

00131_PSISDG12506_125063L_page_4_5.jpg

In the equation, G (•) represents the maximum and minimum normalization, and λm and λt represent the partial weight coefficients of Multi-MemAE and TS-CVAE, respectively.

2.3

Video frame anomaly score settings

The frame anomaly score of the TS-MemCVAE model during testing consists of two parts: the video frame reconstruction error Sr and the future frame prediction error Sp, and the final frame anomaly score is obtained by fusing the image errors of these two parts10, 11. Sr consists of two parts, the first part is the PSNR of the reconstructed image and the input image, and the second part is composed of three memory modules. Sp is represented by the L2 distance between the predicted image and the input image.

00131_PSISDG12506_125063L_page_5_1.jpg
00131_PSISDG12506_125063L_page_5_2.jpg
00131_PSISDG12506_125063L_page_5_3.jpg

𝜗r and 𝜗p in equation (10) represent the weight coefficients of reconstruction error Sr and prediction error Sp, and γ in equation (8) represents the equalization coefficient of reconstruction error. The larger the value of the abnormal score S, the higher the abnormality of the current frame.

3.

EXPERIMENTAL RESULTS AND THEIR ANALYSIS

3.1

Performance evaluation criteria and experimental details settings

The TS-MemCVAE model is experimented on two benchmark datasets (UCSD-Ped26, CUHK Avenue6), and its performance is evaluated using AUC. The model is implemented using Pytorch1.4, the optimizer is Adam, the learning rate is set to 2e-4, and it runs on the experimental platform of NVIDIA Geforce RTX2060 (6G). The size of the images is uniformly adjusted to 224X224, and the pixel range is converted from (0, 255) to (-1, 1). The hyperparameter β, γ, 𝜗r, 𝜗p, λc, λd, λm, λt are set as: 0.9, 0.01, 0.5, 0.5, 0.15, 0.15, 0.4, 0.6.

3.2

Performance comparison analysis with other models

As can be seen from Table 1, the R-STAE model detects anomalies using a residual spatiotemporal autoencoder composed of 3D convolutions and LSTMs. The DSTN model is to build a GAN structure network, using the optical flow graph as the reconstruction object and the video frame as the auxiliary information, to detect anomalies by the difference between the generated optical flow and the input optical flow. The TS-MemCVAE model takes the video frame as the prediction object and the optical flow images as the supplementary condition. The MGFC-AAE model uses a parallel two-stream variational encoder as the backbone network, and detects anomalies by comparing the difference between the acquired latent variable Z distribution and the prior Gaussian distribution, without using reconstructed images. The GBA-AE model uses a single-encoder dual-decoder network structure to obtain a gradient-based attention feature map with the reconstruction loss of the reconstructed image and the input image, which is fused with the features at the prediction decoder to generate future frames and combined with the ground truth Frame comparison detects anomalies to reconstruct the generalization ability of image-constrained autoencoders. The TS-MemCVAE model uses multiple memory modules to constrain the reconstructed image, and further constrains the predicted image with optical flow information in the prediction task. Compared with R-STAE, DSTN, and MGFC-AAE, the TS-MemCVAE model increases the memory constraint on the generalization of the autoencoder, but the model does not perform well on the CUHK Avenue dataset and cannot extract the intrinsic feature distribution from large datasets well.

Table 1.

AUC comparison of different models on two datasets.

MethodsUCSD Ped2CUHK Avenue
R-STAE1283%82%
DSTN1395.5%87.9%
MGFC-AAE1491.6%84.2%
GBA-AE1595.8%87.4%
TS-MemCVAE95.83%84.16%

3.3

Experimental results of the model on anomalous datasets

The ROC curves of the TS-MemCVAE model in the UCSD Ped2 dataset and the CUHK Avenue dataset are shown in Figure 4. Figures 5 and 6 show some abnormal behaviors in the abnormal data set (Figures 5a, 5b, and 5c represent the original image, the predicted image, and the difference image, respectively). As can be seen from Figure 6, the model can predict abnormal behaviors such as fast running, and the differential image can relatively clearly show the motion outline of the abnormal behavior. As can be seen from Figure 5, in the UCSD Ped2 dataset, the gap between abnormal behavior and normal behavior is small, and the model contradicts the ROC curve and the predicted results. From the perspective of the composition of the model anomaly score, it can be seen that when the difference between the video frame and the predicted image is small or not at all, the difference between the normal pattern feature in the memory module and the extracted feature is the decisive factor for whether the dominant frame is abnormal or not. The multiple memory modules in the model not only effectively constrain the generalization of the autoencoder, but also play an important role in quantifying the degree of abnormality of video frames.

Figure 4.

ROC curves of the model on both datasets.

00131_PSISDG12506_125063L_page_6_1.jpg

Figure 5.

Abnormal behavior detection results of the model on the UCSD Ped2 dataset.

00131_PSISDG12506_125063L_page_6_2.jpg

Figure 6.

Abnormal behavior detection results of the model on the CUHK Avenue dataset.

00131_PSISDG12506_125063L_page_7_1.jpg

3.4

Ablation experiment

In order to verify the influence of the multi-level memory module in the TS-MemCVAE model and the video frame and its optical flow image in the dual-stream input on its performance, corresponding ablation experiments were performed on each module. The results are shown in Table 2, where RGB represents the input video frame, Flow represents the corresponding optical flow image. It can be seen from Table 2 that the memory module is inserted in the bottleneck and decoding part of the autoencoder. The change of the position of the memory module has little effect on the performance of the model. Multiple memory modules are inserted into different positions of the autoencoder. Generalization plays a considerable constraint role; and the addition of optical flow images enables the model to enhance the acquired motion information features, thereby improving the performance of the model.

Table 2.

Effects of different modules in the model on performance.

RGBFlowAUC
XXX93.23
XXX93.72
XXX94.68
XX95.83

4.

CONCLUSION

In this paper, we propose a cascaded multi-level memory-augmented two-stream conditional variational autoencoder model (TS-MemCVAE) that combines video frame reconstruction and future frame prediction methods. The model consists of two parts: Multi-level Memory Augmented Autoencoder ((Multi-MemAE) and Two-Stream Conditional Variational Autoencoder (TS-CVAE). In the first stage, Multi-MemAE inserts memory modules at the bottleneck of the auto-encoder and in each layer of the decoder, imposes layer-by-layer constraints on the generalization of the auto-encoder, and adds skip connections between the encoder and decoder. reduce feature loss; In the second stage, the TS-CVAE network captures appearance and motion information in video frames. The results show that the pre-reconstruction of video frames by the TS-MemCVAE model can effectively influence the prediction of future frames. However, because the model adopts a cascade structure composed of dual autoencoders, its structure is relatively simple but there are too many hyperparameters, making the model difficult to optimize. Therefore, the hyperparameter modulation strategy of the model will be further improved in the next work.

ACKNOWLEDGMENTS

This work is sponsored by Natural Science Foundation of Chongqing, China (cstc2021jcyj-msxmX0525).

REFERENCES

[1] 

Wang, Z. and Zhang, Y., “Anomaly detection in surveillance videos: A survey,” Journal of Tsinghua University (Science and Technology), 60 518 –529 (2020). Google Scholar

[2] 

Ji, G., Xu, Z., Li, X., et al., “Progress on abnormal event detection technology in video surveillance,” Journal of Nanjing University of Aeronautics & Astronautics, 52 685 –694 (2020). Google Scholar

[3] 

Prawiro, H., Peng, J. W., Pan, T. Y., et al., “Abnormal event detection in surveillance videos using two-stream decoder,” 2020 ICMEW, 1 –6 (2020). Google Scholar

[4] 

Mehmood, A., “Abnormal behavior detection in uncrowded videos with two-stream 3D convolutional neural networks,” Applied Sciences, 11 3523 –3548 (2021). https://doi.org/10.3390/app11083523 Google Scholar

[5] 

Bouindour, S., Hu, R. and Snoussi, H., “Enhanced convolutional neural network for abnormal event detection in video streams,” 2019 AIKE, 172 –178 (2019). Google Scholar

[6] 

Nguyen, T. N. and Meunier, J., “Anomaly detection in video sequence with appearance-motion correspondence,” 2019 ICCV, 1273 –1283 (2019). Google Scholar

[7] 

Zhao, B., Zhao, B. and Li, P., “Video anomaly detection based on frame prediction of generative adversarial network,” 2021 ICESIT, 387 –391 (2021). Google Scholar

[8] 

Gong, D., Liu, L., Le, V., et al., “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” 2019 ICCV, 1705 –1714 (2019). Google Scholar

[9] 

Hu, C., Wu, F., Wu, W., et al., “Normal learning in videos with attention prototype network,” arXiv preprint arXiv:2108.11055, (2021). Google Scholar

[10] 

Pang, G., Yan, C., Shen, C., et al., “Self-trained deep ordinal regression for end-to-end video anomaly detection,” 2020 CVPR, 12170 –12179 (2020). Google Scholar

[11] 

Li, N., Chang, F. and Liu, C., “Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes,” IEEE Transactions on Multimedia, 23 203 –215 (2020). https://doi.org/10.1109/TMM.2020.2984093 Google Scholar

[12] 

Deepak, K., Chandrakala, S. and Mohan C. K., “Residual spatiotemporal autoencoder for unsupervised video anomaly detection,” Signal Image and Video Processing, 215 –222 (2020). Google Scholar

[13] 

Ganokratanaa, T., Aramvith, S. and Sebe, N., “Unsupervised anomaly detection and localization based on deep spatiotemporal translation network,” IEEE Access, 8 50312 –50329 (2020). https://doi.org/10.1109/Access.6287639 Google Scholar

[14] 

Fan, Y., Wen, G., Li, D., et al., “Video anomaly detection and localization via gaussian mixture fully convolutional variational autoencoder,” Computer Vision and Image Understanding, 195 102920 (2020). https://doi.org/10.1016/j.cviu.2020.102920 Google Scholar

[15] 

Lai, Y., Liu, R. and Han, Y., “Video anomaly detection via predictive autoencoder with gradient-based attention,” 2020 ICME, 1 –6 (2020). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Hongmin Zhang, Xiaobing Fang, and Xu Zhuang "Research on video anomaly detection with variational auto-encoder based on multi-level memory enhancement", Proc. SPIE 12506, Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), 125063L (28 December 2022); https://doi.org/10.1117/12.2662577
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

RGB color model

Image restoration

Optical flow

Data modeling

Performance modeling

Computer programming

Back to Top