Many deep learning architectures exist for semantic segmentation. In this paper, their application to multi-modal remote sensing data is examined. Two well-known network architectures, U-Net and DeepLab V3+, developed originally for RGB image data, are modified to accept additional input channels, such as near infrared or depth information. In both networks, ResNet101 is used as the backbone, while data-preprocessing steps, including data augmentation, are identical. We compare both networks and experiment with different fusion techniques in U-Net and with hyper-parameters for weighting the input channels for fusion in DeepLab V3+. We also evaluate the effect of pre-training on RGB and non-RGB data. The results show a minimally better performance of the DeepLab V3+ model compared to U-Net, while for the certain classes, such as vehicles, U-Net yields a slightly superior accuracy.
|