Convolution neural networks (CNNs) and transformers are good at extracting local and global features, respectively, whereas both local and global features are important for the no-reference image quality assessment (NR-IQA) task. Therefore, we innovatively propose a CNN–transformer dual-stream parallel fusion network for NR-IQA that can simultaneously extract local and global hierarchical features related to image quality. In addition, considering the importance of saliency in NR-IQA, a saliency-guided CNN and transformer feature fusion module is proposed to fuse and optimize the hierarchical features extracted by the dual-stream network. Finally, the high-level features of the dual-stream network are fused through the local and global cross-attention module to better model the interaction relationship between local and global information in the image, and the quality prediction module containing evaluation and weight branches is used to obtain the quality score of distorted images. To comprehensively evaluate the performance of our model, we conducted experiments on six standard image quality assessment datasets, and the experimental results showed that our model has better quality prediction performance and generalization ability than previous representative NR-IQA models.
Under unfavorable conditions, fusion images of infrared and visible images often lack edge contrast and details. To address this issue, we propose an edge-oriented unrolling network, which comprises a feature extraction network and a feature fusion network. In our approach, after respective enhancement processes, the original infrared/visible image pair with their enhancement version is combined as the input to get more prior information acquisition. First, the feature extraction network consists of four independent iterative edge-oriented unrolling feature extraction networks based on the edge-oriented deep unrolling residual module (EURM), in which the convolutions in the EURM modules are replaced with edge-oriented convolution blocks to enhance the edge features. Then, the convolutional feature fusion network with differential structure is proposed to obtain the final fusion result, through utilizing the concatenate operation to map multidimensional features. In addition, the loss function in the fusion network is optimized to balance multiple features with significant differences in order to achieve better visual effect. Experimental results on multiple datasets demonstrate that the proposed method produces competitive fusion images as evaluated subjectively and objectively, with balanced luminance, sharper edge, and better details.
Unsupervised person reidentification (re-ID) is designed to deal with the problem that in industrial application scenarios, the consistent features of the same person cannot be fully mined due to the lack of annotated information in the collected image resources. Researchers focus on using pseudolabels to label images with the same clustering attributes, but traditional clustering methods are prone to generating noisy pseudolabels, which greatly reduce the accuracy of unsupervised person re-ID. We propose a method to refine pseudolabels by generating discriminative information based on channel partitioning. In the design process, the feature map is divided from the channel level, generating a proximity matrix by calculating the distance between the channel feature and the global feature, and select the most negative sample as the verification label to optimize the pseudolabel of the global feature. As a bidirectional guide, the global pseudolabel can be used to further smooth each channel label, keeping the consistency of channel features and global features in pseudolabel selection. This unique method of learning local features can link global information and channel information. Experimental results demonstrate the superior performance of our proposal on representative person re-ID datasets.
This paper proposes a network model based on a three-stream network and improved attention mechanism for blind image quality assessment (TSAIQA). The inputs of the three streams are the distorted image, the pseudoreference image obtained by the improved generative adversarial network (GAN), and the gradient map of the distorted image. The distorted image stream focuses on the holistic quality-related features, the pseudoreference image stream is used to supplement the lost features due to distortion, and the gradient stream explicitly extracts the quality-related structural features. In addition, spatial and channel attention mechanisms combining first- and second-order information are proposed, and the improved attention mechanisms are applied to the three-stream network to optimize spatial and channel-level features effectively. Finally, the three-stream fusion features are input to the quality regression network to predict the image quality. To demonstrate the effectiveness of the proposed model, experiments are conducted on four classical IQA databases and two new large-scale databases. The experimental results show that the results of our TSAIQA model outperform the most advanced IQA methods and confirm the effectiveness of the proposed network structure and attention mechanisms.
We propose a saliency-enhanced two-stream convolutional network (SETNet) for no-reference image quality assessment. The proposed SETNet contains two subnetworks of image stream and saliency stream. The image stream focuses on the whole image content, while the saliency stream explicitly guides the network to learn spatial salient features that are more attractive to humans. In addition, the spatial attention module and dilated convolution-based channel attention module are employed to refine multiple levels features in spatial and channel dimensions. Finally, the image stream and saliency stream features fusion strategy is proposed to integrate features at the corresponding layer, and the final quality scores are predicted by using multiple levels of integrated features and weighting strategy. The experimental results of the proposed method and several representative methods on four synthetic distortion datasets and two real distortion datasets show that our SETNet has higher prediction accuracy and generalization ability.
Cross-modality person re-identification (Re-ID) between RGB and infrared domains is a hot and challenging problem, which aims to retrieve pedestrian images cross-modality and cross-camera views. Since there is a huge gap between two modalities, the difficulty of solving the problem is how to bridge the cross-modality gap with images. However, most approaches solve this issue mainly by increasing interclass discrepancy between features, and few research studies focus on decreasing intraclass cross-modality discrepancy, which is crucial for cross-modality Re-ID. Moreover, we find that despite the huge gap, the attribute representations of the pedestrian are generally unchanged. We provide a different view of the cross-modality person Re-ID problem, which uses additional attribute labels as auxiliary information to increase intraclass cross-modality similarity. First, we manually annotate attribute labels for a large-scale cross-modality Re-ID dataset. Second, we propose an end-to-end network to learn modality-invariant and identity-specific local features with the joint supervision of attribute classification loss and identity classification loss. The experimental results on a large-scale cross-modality Re-ID benchmarks show that our model achieves competitive Re-ID performance compared with the state-of-the-art methods. To demonstrate the versatility of the model, we report the results of our model on the Market-1501 dataset.
Example-based face sketch synthesis technology generally requires face photo-sketch images with face alignment and size normalize. To break through the limitation, we propose a global face sketch synthesis method: In training, all training photo-sketch patch pairs are collected together and a photo feature dictionary is learned from the photo patches. For each atom of the dictionary, its K closest photo-sketch patch pairs are clustered, namely “Anchored Neighborhood”. In testing, for each test photo patch, we search its nearest photo patch in the Anchored Neighborhood determined by its closest atom, then the corresponding sketch patch is the output. By the same way, we train and test in the high-frequency domain and synthesis the high-frequency results. Finally, the fusion of the initial and the high-frequency results is the final sketch. The experiments on three public face sketch datasets and various real-world photos demonstrate the effectiveness and robustness of the proposed method
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.