|
1.INTRODUCTIONIn the field of computer vision, the pedestrian re-identification task usually first gives the monitoring image of a specific pedestrian and uses the pedestrian re-identification technology to find the image taken by the pedestrian under other cameras in the database1. Due to the different position and angle of view of the camera, and the influence of pedestrian posture, occlusion, illumination changes, and other factors, the images of the same pedestrian are quite different, which makes pedestrian re-identification change into a hot research topic. In the early days, researchers realized the task of pedestrian re-identification uses traditional methods. With the development of convolutional neural networks, the method based on deep learning is applied to the task of pedestrian re-identification. At present, the mainstream methods of pedestrian re-identification based on deep learning mostly adopt average pooling, max pooling, or a combination of the two methods. Combined with the features of average pooling and max pooling. We propose a pedestrian re-identification method based on Multi-scale Residual Pooling. According to the information contained in different scale features in the network, a new residual pooling module is used to combine the advantages of average pooling and max pooling to extract more comprehensive and more discriminative residual pooling features to represent pedestrians, and improve the performance of pedestrian re-identification network. 2.RELATED WORKThe conventional pedestrian recognition methods are mainly included feature based and distance measurement method. The feature representation method mainly extracts features such as color, LBP2, and SIFT3. Due to the limitations of a single feature in pedestrian target representation, researchers have proposed many other methods: Reference4 uses a cumulative color histogram to represent global features, and then extracts local features; Reference5 introduced LOMO. Its idea is that the distance of the same pedestrian target should be less than the different pedestrian targets. KISSME6 and the LMNN7 algorithm are used to learn the best similarity measure. The traditional pedestrian re recognition method has limited ability to extract features, which is not effective for the recognition task in the actual scene. Numerous scholars apply advanced deep learning to solve the pedestrian re-identification task nowadays. At present, most research on deep learning are devoted to extracting global and local features to obtain discriminative pedestrian feature expression. GLAD8 uses global and local thinning methods to extract features; Reference9 proposed an evenly partitioned PCB model. After the obtained features are equally divided, the image blocks are aligned through the RPP network, and then the local features of each image block are extracted. When constructing convolution neural network, the general processing is to insert a pooling layer followed the convolutional layer. Its function is to reduce feature dimension of convolution output and suppress noise and prevent overfitting Average pooling in a convolutional neural network can transfer feature information completely, but it is easily affected by background noise; Max pooling can extract features with better recognition but pay more attention to local information. Table 1 shows the pooling methods of mainstream networks. Most pedestrian re-identification methods based on convolutional neural networks only use average pooling or max pooling, or simply fuse the output features of the two pooling. Table 1.Pooling methods of mainstream network.
3.PERSON RE-IDENTIFICATION METHOD BASED ON MULTI-SCALE RESIDUAL POOLINGThe pedestrian re-identification network structure includes: multi-scale feature extraction and residual feature acquisition. Figure 1 shows the network structure. Figure 1.Person re-identification network structure based on multi-scale residual pooling feature fusion. ![]() 3.1Multi-scale feature extractionDifferent levels of the convolutional neural network will produce different spatial resolution feature maps, and the feature maps obtained through different convolution layers contain different information. High-level features focus more on semantic information and less on image details., while the low-level features may contain more details and chaotic background information. Therefore, most researchers combine the features of multiple scales to complement the features of different levels in this simple and effective way. The pedestrian re-identification network structure designed in our paper removes the last full connection layer of the ResNet50 network, plus the average pooling layer and the max pooling layer, as shown in Figure 1. Avg (m) and Max (m) represent average pooling and max pooling respectively, and a feature mAP with width and height of m is obtained. In this paper, the output features of layer 3 in the ResNet50 network are global average pooling and global max pooling respectively, and Pavg1 and Pmax1 feature maps with output dimensions of 1×1×1024 are obtained. Similarly, the output features of layer4 in the ResNet50 network are global average and global max respectively, and Pavg2 and Pmax2 feature maps with output dimensions of 1×1×2048 are obtained. To reduce the impact of pooling on information loss, the stripes of average pooling and max pooling are adjusted to obtain richer feature information. Pavg3 and Pmax3 feature maps with output dimensions of 2×2×2048 are obtained. The multi-scale features Pavg1, Pmax1, Pavg2, Pmax2, Pavg3, and Pmax3 of the pedestrian images are sent to the residual pooling module to obtain the corresponding residual features Pcont1, Pcont2, and Pcont3, which are transformed into a unified dimension for fusion, the fused features are sent to the classifier for classification, and finally, the pedestrian re-identification results are obtained. 3.2Residual pooling moduleFigure 2 shows the average pooling and max pooling commonly used in convolutional neural networks. Average pooling is to average the features in the neighborhood, and max pooling is to maximize the features in the neighborhood. Although the average pooling features can transfer the global information of the image more completely, their calculation methods are easily affected by background clutter and occlusion, and cannot highlight the difference between pedestrians and the background. Compared with the average pooling, the max pooling can reduce the impact of background clutter, but the max pooling pays more attention to extracting the local salient features of pedestrian images and the contour information of pedestrians. The pooling features cannot completely contain the whole body information of pedestrians. In the actual recognition environment, due to the change of camera angle and external illumination, it is necessary to remove the influence of background clutter while preserving the whole body information of pedestrians and highlighting the difference between pedestrians and background. On this basis, this paper proposes a residual pooling module, as shown in Figure 3. By combining the advantages of max pooling and average pooling, this module can make up for the shortcomings of average pooling and max pooling, and on the basis of preserving the whole body information, highlight the pedestrian contour and focus on the difference between the pedestrian and the background, to make the final feature expression of pedestrian images more comprehensive and discriminatory and improve the accuracy of pedestrian re-identification. As can be seen from Figure 3, the residual pooling module subtracts the multi-scale feature Pavg of a pedestrian image extracted by Resnet50 from Pmax and uses a convolutional kernel of 1×1 to obtain the difference between Pavg and Pmax. The Pmax feature obtained through max pooling also uses a convolutional kernel of 1×1, and adds the difference feature between Pavg and Pmax to obtain the residual feature Pcont. The residual pooling module integrates the merits of max pooling and average pooling. The obtained residual feature Pcont not only covers the whole body of pedestrians and deepens the contour of pedestrians, but also reduces the impact of background clutter, and pays more attention to the difference between pedestrians and background. The residual feature Pcont is calculated as shown in equation (1): where: Pavg and Pmax are the features obtained after average pooling and max pooling respectively; δ1×1(x) are used for 1×1. 3.3.Loss FunctionWe use the joint optimization model of triple margin loss and cross-entropy loss to train the model. The loss function is shown in equation (2): where: Ltotal_loss is a total loss. 4.EXPERIMENTTo achieve objective comparative analysis, this experiment includes the following four training strategies: 1) The learning rate uses Warmup mode in the training stage; 2) Random erasure the data of the training set with a probability of 0.5; 3) Label smoothing is used to improve the generalization performance of the model; 4) BNNeck is used to normalize features. In addition, we use the mean value of three repeated experiments as the experimental results to avoid randomness and ensure the accuracy of the experimental results. 4.1Experimental datasetThis paper’s comparative laboratory is based on the market150116 and Duke MTMC Reid17 datasets. Market150116 includs 1501 pedestrians captured by 6 cameras. Training set: 12936 images of 751 pedestrians. test set: 19732 images of an additional 750 pedestrians. Test set includes: query set and gallery set. MTMC-ReID17 dataset includes 36411 images of 1404 pedestrians captured by 8 cameras. Training set:16522 images of 702 pedestrians. Test set: 17661 images of an additional 702 pedestrians. Test set includes: query set and gallery set. 4.2Experimental preparationThe algorithm is implemented on PyTorch framework. We are experimenting with a computing platform with a GPU model NVIDIA GTX1080Ti. CPU model Intel® Core™ i7-7700k @ 4.20 GHz, the memory is 32GB. When training the model, the resolution of the input pedestrian image is set to 288×144 pixels, 32 training batches, and 220 iterations in total. We use SGD optimizer, set the learning rate to 0.03 and the weight decay rate 0.0005. When iteration ordinal number gradually increases to 0.003, it then drops to 0.003, 0.0003, and 0.0003 at 40, 110, and 150 iterations, respectively. 4.3Experimental evaluation criteriaIn this experiment, Rank-1, Rank-5, Rank-10 and mean Average Precision (mAP) in the cumulative matching features (CMC) curve are used as evaluation indicators. During the test, take a query image from the query set and measure the similarity between all the images in the test set and the query image. CMC refers to the probability of successful matching with the query image in the first K candidate images. The values of Rank-1, Rank-5, and Rank-10 are the corresponding accuracy when K=1, 5, and 10 in CMC (K). the mAP is the average of the areas under the accuracy recall curve for all samples. 4.4Analysis of experimental resultsTo validate the proposed method, a comparative experiment is carried out on Market150116 and DukeMTMC-reID17 datasets. Figure 4 shows the re-identification results. The first column of each row is the query image, the last 10 columns are the top 10 query results. In Figure 4, the solid line border represents the correct query results, and the dotted line border represents the wrong query results. The AVG method is to pool the basic features of images by using our pedestrian re-identification network structure to obtain the feature maps. Max method is to maximize the pool and get the feature graph. Figure 5 shows the comparison of accuracy and loss value between avg method, max method, and residual pool module (residual) method on the Market150116 dataset. The accuracy of our method is higher than avg method and max method, and the loss value decreases faster. It is verified that the residual pooling feature obtained based on the residual pooling module can reach better performance in pedestrian re-identification tasks. In this paper, Grad-CAM avg is used to visualize avg, max method, and residual methods. Max method and residual method are visualized by using the Grad-CAM class activation thermodynamic diagram. Figure 6 shows that the AVG method performs well for the acquisition of whole body information, but it is also easily affected by the background clutter; the max method focuses on to the local pedestrian contour but does not include the whole pedestrian body. Our method combines the advantages of both methods, which can include the whole body of pedestrians and reduce the impact of background clutter. Figure 6.Visualization results of different methods. (a): input images; (b): avg method; (c): max method; (d): residual method. ![]() Table 2 shows the comparison of several mainstream methods’ indicators on the Market15016 and DukeMTMC-reID17 datasets. Our residual +re ranking method optimizes the network performance by re-ranking18 technology, ResNet50_baseline method directly utilizes the features of layer 4 output of the ResNet50 network to optimize the network performance. The mAP of the residual +re-ranking methodon Market1501 and DukeMTMC-reID datasets is 94.52% and 89.30% respectively. This method’s indicators can be significantly improved with resnet50 as the backbone network. Table 2.Performance indicators comparison 1 among different methods (%).
Table 3 shows the indicators of Residual, Residual +re-ranking and the representative methods (SVDNet, GLAD8, PCB9 PCB+RPP9 BEF, etc.), The indicators of the comparison methods are all quoted from the original text, where “—” means that there is no such experimental result in the original text. At the same time, the indicators after using re-ranking technology to optimize our method are given. Rank1 and mAP of Residual +re-ranking method are better than the current advanced methods, especially in the mAP indicators. On the Market150116 dataset, after reordering, the proposed method (residual + re-ranking) is 2.61% higher than the PCB+RPP method based on local features in Rank1 and about 13% higher in mAP. Compared with the BEF method, Rank1 and mAP are increased by 1.1% and 7.8%, respectively; Compared with the DG-Net method, Rank1 and mAP are increased by 1.61% and 8.52%, respectively; It is 2.2% higher than that of CtF method in Rank1 and 9.62% points higher in mAP. On the DukeMTMC-reID17 dataset, compared with the BEF, the Rank1 of this method (residual) is lower, but the mAP is increased by 1.7%. After reordering, the Rank1 and mAP of Residual +re-ranking are higher than the representative methods. Table 3.Performance indicators comparison 1 among different methods (%).
Table 4 shows the comparison of parameter quantity, calculation quantity, and inference time between this method and the comparison method. Compared with most representative methods, the method in Residual has more model Parameters, but the method in Residual has less computation (FLOPs) and consumes less time for reasoning in a single image. Table 4.Parameters, calculation, and inference time comparison among different methods.
5.CONCLUSIONIn order to solve the problem of low feature expression ability in person re-identification network, combined with the advantages of max pooling and average pooling, we propose a multi-scale residual pooling pedestrian re-identification method, which makes up for the deficiency of max pooling and average pooling by extracting different scale features in the network and inputting the residual pooling module, and can make the network focus on the difference between pedestrians and background in the image. Experiments on Market1501 and DukeMTMC-reID datasets show that, our method can significantly increase the accuracy, compared with SVDNet, GLAD, and PCB methods, and its first hit rate and average accuracy on Market1501 datasets are 96.41% and 94.52% respectively. In the future, more prominent features will be extracted by using deformable convolution or introducing an attention mechanism, to further improve the performance indicators of pedestrian re-identification task. REFERENCESSong, W., Zhao, Q., Chen, C., et al,
“Survey on pedestrian re-identification research,”
CAAI Transactions on Intelligent Systems, 12
(6), 770
–780
(2017). Google Scholar
Ojala, T., Pietikainen, M. and Maenpaa, T.,
“Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24
(7), 971
–987
(2002). https://doi.org/10.1109/TPAMI.2002.1017623 Google Scholar
Lowe, D. G.,
“Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, 60
(2), 91
–110
(2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94 Google Scholar
Bazzani, L., Crisyani, M., Perina, A., et al,
“Multiple-shot person re-identification by chromatic and epitomic analyses,”
Pattern Recognition Letters, 33
(7), 898
–903
(2012). https://doi.org/10.1016/j.patrec.2011.11.016 Google Scholar
Liao, S., Hu, Y., Zhu, X., et al,
“Person re-identification by local maximal occurrence representation and metric learning,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
2197
–2206
(2015). Google Scholar
Koestinger, M., Hirzer, M., Wohlhart, P., et al,
“Large scale metric learning from equivalence constraints,”
in IEEE Conference on Computer Vision and Pattern Recognition,
2288
–2295
(2012). Google Scholar
Weinberger, K. Q. and Saul, L. K.,
“Distance metric learning for large margin nearest neighbor classification,”
Journal of Machine Learning Research, 10
(2), 207
–244
(2009). Google Scholar
Wei, L. H., Zhang, S. L., Yao, H. T., et al,
“GLAD: Global local-alignment descriptor for scalable person re-identification,”
IEEE Transactions on Multimedia, 21
(4), 986
–999
(2018). https://doi.org/10.1109/TMM.6046 Google Scholar
Sun, Y. F., Zheng, L., Yang, Y., et al,
“Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),”
in European Conf. on Computer Vision,
480
–496
(2018). Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., et al,
“Gradient-based learning applied to document recognition,”
in Proceedings of the IEEE,
2278
–2324
(1998). Google Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G. E.,
“Imagenet classification with deep convolutional neural networks,”
Advances in Neural Information Processing Systems, 25 1097
–1105
(2012). Google Scholar
Simonyan, K. and Zisserman, A.,
“Very deep convolutional networks for large-scale image recognition,”
arXiv:1409.1556v6,
(2014). Google Scholar
Lin, M., Chen, Q. and Yan, S.,
“Network in network,”
arXiv:1312.4400v3,
(2013). Google Scholar
Szegedy, C., Liu, W., Jia, Y., et al,
“Going deeper with convolutions,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
1
–9
(2015). Google Scholar
He, K., Zhang, X., Ren, S., et al,
“Deep residual learning for image recognition,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
770
–778
(2016). Google Scholar
Zheng, L., Shen, L. Y., Tian, L., et al,
“Scalable person re-identification: A benchmark,”
in IEEE Inter. Conf. on Computer Vision,
1116
–1124
(2015). Google Scholar
Zheng, Z. D., Zheng, L. and Yang, Y.,
“Unlabeled sample generated by gan improve the person re-identification baseline in vitro,”
in IEEE Inter. Conf. on Computer Vision,
3774
–3782
(2017). Google Scholar
Zhong, Z., Zheng, L., Cao, D. L., et al,
“Re-ranking person re-identification with k-reciprocal encoding,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
1318
–1327
(2017). Google Scholar
Sun, Y. F., Zheng, L., Deng, W. J., et al,
“SVDNet for pedestrian retrieval,”
in IEEE Inter. Conf. on Computer Vision,
3800
–3808
(2017). Google Scholar
Dai, Z. Z., Chen, M. V. Q., Gu, X. D., et al,
“Batch feature erasing for person re-identification and beyond,”
arXiv:1811.07130v2,
(2019). Google Scholar
Lin, Y. T., Zheng, L., Zheng, Z. D., et al,
“Improving person re-identification by attribute and identity learning,”
Pattern Recognition, 95
(C), 151
–161
(2019). https://doi.org/10.1016/j.patcog.2019.06.006 Google Scholar
Zheng, Z. D., Yang, X. D., Yu, Z. D., et al,
“Joint discriminative and generative learning for person re-identification,”
in IEEE Conf. on Computer Vision and Pattern Recognition,
2138
–2147
(2019). Google Scholar
Wang, G. A., Gong, S. G., Cheng, J., et al,
“Faster person re-identification,”
in European Conf. on Computer Vision,
275
–292
(2020). Google Scholar
|