With the continuous development of remote sensing imagery in deep learning, this paper proposes a self-attention model called Dual-Stream Swin Transformer to address the computational and memory requirements issues traditional Transformers face when dealing with high-resolution images. Specifically, 1) The Dual-Stream Swin Transformer in this paper adopts an innovative approach by decomposing the traditional Transformer encoder layer into smaller building blocks and introducing a shifted windows mechanism to construct self-attention. 2) Traditional Transformer models require significant computational and storage resources when processing high-resolution images because they perform self-attention calculations on the entire global image. However, the Swin Transformer significantly reduces the computational and memory requirements by segmenting the image into multiple spatially overlapping windows and performing self-attention calculations within each window. 3) Furthermore, the introduced shifted windows mechanism ensures that the attention of each window is only related to its adjacent windows, further reducing the computational complexity. This decomposition and window mechanism combination makes the Swin Transformer ideal for handling high-resolution visual inputs. It achieves high precision while offering higher computational efficiency and lower memory consumption. This makes the Swin Transformer perform excellently in image classification, object detection, and semantic segmentation tasks. We conducted comparative experiments between this model and other classical network models of the same type. The Dual-Stream Swin Transformer in this paper effectively addresses traditional Trans-formers' computational and memory challenges when handling high-resolution images through innovative decomposition and window mechanisms, providing a new solution for efficiently processing large-scale visual data.
|