This paper presents two deep learning models using a multi-perspective convolutional neural network (CNN) for classifying objects in the context of intelligent transportation systems (ITS). The proposed model categorizes objects accurately, enabling them to make well-informed decisions in multi-object (such as Persons, Trucks, Motorbikes, Cars, and Cyclists.) detection in complex scenarios for automotive applications. The custom backbone model is designed based on experimentation with the VGG backbone network based on the VGG backbone network, incorporating a multilayer prediction head and custom feature extraction blocks for classifying multiple objects in complex scenes. The model is to extract abstract features and features at multiple scales with a custom-designed feature extraction backbone with multiple blocks. The proposed models are lightweight and require fewer computational resources for high classification performance. The automotive publicly available dataset with 19800 images and labels has been used. Results show that when we experimented with the VGG backbone CNN model, the classification accuracy of 99.64% is achieved, and on the other hand, the classification accuracy of custom backbone CNN is 99.46%. The performance of the proposed custom model is also compared to those of pre-trained benchmark models. The experimental findings presented in this paper show that the proposed models achieve higher accuracy than the pre-trained models.
A robust object detection algorithm is essential while detecting objects in videos and real time scenarios, where false positives might result in unwanted outcomes. Our goal here is to observe how Simple Online and Real-time Tracking with a Deep association metric (Deep SORT) algorithm for Multi-Object Tracking (MOT) can be used to minimize false positives, from a state of the art detection algorithm like You Only Look Once (YOLO), by using the Kalman filter approach. An auto encoder based feature extractor has been used, instead of the standard CNN networks like ResNet-50 to further improve speed of the detector. There have been other MOT algorithms in the recent times which give good results, but are not as real time efficient as the simple yet efficient Deep SORT method. Experimental analysis has shown how Autoencoder based Deep SORT performs in contrast to native Deep SORT and YOLO, in eliminating false positive detections.
Air quality monitoring plays a vital role in the sustainable development of any country. Continuous monitoring of the major air pollutants and forecasting their variations would be helpful in saving the environment and improving the quality of public health. However, this task becomes challenging with the available observations of air pollutants from the on-ground instruments with their limited spatial coverage. We propose a multimodal deep learning network (M2-APNet) to predict major air pollutants at a global scale from multimodal temporal satellite images. The inputs to the proposed M2-APNet include satellite image, digital elevation model (DEM), and other key attributes. The proposed M2-APNet employs a convolutional neural network to extract local features and a bidirectional long short-term memory to obtain longitudinal features from multimodal temporal data. These features are fused to uncover common patterns helpful for regression in predicting the major air pollutants and categorization of air quality index (AQI). We have conducted exhaustive experiments to predict air pollutants and AQI across important regions in India by employing multiple temporal modalities. Further, the experimental results demonstrate the effectiveness of DEM modality over others in learning to predict major air pollutants and determining the AQI.
Generalized zero-shot learning (GZSL) is the most popular approach for developing ZSL, which involves both seen and unseen classes in the classification process. Many of the existing GZSL approaches for scene classification in remote sensing images use word embeddings that do not effectively describe unseen categories. We explore word embedding to describe the classes of remote sensing scenes to improve the classification accuracy of unseen categories. The proposed method uses a data2vec embedding based on self-supervised learning to obtain a continuous and contextualized latent representation. This representation leverages two advantages of the standard transformer architecture. First, targets are not predefined as visual tokens. Second, latent representations preserve contextual information. We conducted experiments on three benchmark scene classification datasets of remote sensing images. The proposed approach demonstrates its efficacy over the existing GZSL approaches.
Contrary to the World Health Organization’s (WHO) and the medical community’s projections, Covid-19, which started in Wuhan, China, in December 2019, still doesn’t show any signs of progressing to the endemic stage or slowing down any time soon. It continues to wreak havoc on the lives and livelihood of thousands of people every day. There is general agreement that the best way to contain this dangerous virus is through testing and isolation. Therefore, in these epidemic times, developing an automated Covid-19 detection method is of utmost importance. This study uses three different Machine Learning classifiers, such as Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR), along with five Transfer Learning models such as DenseNet121, DenseNet169, ResNet50, ResNet152V2, and Xception as feature extraction methods for identifying Covid-19. Five different datasets are used to assess the models’ performance to generalize. There are encouraging findings, with the best one being the combination of DenseNet121 and DenseNet169 together with SVM and LR.
Action recognition is one of the challenging video understanding tasks in computer vision. Although there has been extensive research in the task of classifying coarse-grained actions, existing methods are still limited in differentiating actions with low inter-class and high intra-class variation. Particularly, the table tennis sport that involves shots of high inter-class similarity, subtle variations, occlusion, and view-point variations. While a few datasets have been available for event spotting and shot recognition, these benchmarks are mostly recorded in a constrained environment with a clear view/perception of shots executed by players. In this paper, we introduce a Table tennis shots 1.0 dataset consisting of 9000 videos of 6 fine-grained actions collected in an unconstrained manner to analyze the performance of both players. To effectively recognise these different types of table tennis shots, we propose an adaptive spatial and temporal aggregation method that can handle the spatial and temporal interactions concerning the subtle variations among shots and low inter-class variations. Our method consists of three components, namely, (i) feature extraction module, (ii) spatial aggregation network, and (iii) temporal aggregation network. The feature extraction module is a 3D convolutional neural network (3D-CNN) that captures the spatial and temporal characteristics of table tennis shots. In order to capture the interaction among the elements of the extracted 3D-CNN feature maps efficiently, we employ spatial aggregation network to obtain the compact spatial representation. Later, we propose to replace the final global average pooling layer (GAP) with the temporal aggregation network to overcome the loss of motion information due to averaging of temporal features. This temporal aggregation network utilizes the attention mechanism of bidirectional encoder representations from Transformers (BERT) to model the significant temporal interactions among the shots effectively. We demonstrate that our proposed approach improves the performance of existing 3D-CNN methods by ~10% on the Table tennis shots 1.0 dataset.We also show the performance of our approach on other action recognition datasets, namely, UCF-101 and HMDB-51.
Current encoder-decoder methods for remote sensing image captioning (RSIC) avoids fine-grained structural representation of objects due to the lack of prominent encoding frameworks. This paper proposes a novel structural representative network (SRN) for acquiring fine-grained structures of remote sensing images (RSI) for generating semantically meaningful captions. Initially, we employ SRN on top of the final layers of the convolutional neural network (CNN) for attaining the spatially transformed RSI features. A multi-stage decoder is incorporated into the extracted features of SRN to produce fine-grained meaningful captions. The efficacy of our proposed methodology is exhibited on two RSIC datasets, i.e Sydney-Captions dataset, and the UCM-Captions dataset.
Monitoring airport runways in panchromatic remote sensing images is helpful for both civil and strategic communities in effective utilization of the large-area acquisitions. This paper proposes a novel multimodal semantic segmentation approach for effective delineation of the runways in panchromatic remote sensing images. The proposed approach aims to learn complementary information from two modalities, namely, panchromatic image and digital elevation model (DEM) to obtain discriminative features of the runway. The fusion of image features and the corresponding terrain information is performed by stacking the image and DEM by leveraging the merits of both Transformers and U-Net architecture. We perform the experiments on Cartosat-1 panchromatic satellite images with the corresponding Cartosat-1 DEM scenes. The experimental results demonstrate a significant contribution of terrain information to the segmentation process in achieving the contours of airport runways effectively.
Aircraft type recognition remains challenging, due to their tiny sizes and geometric distortions in large-scale panchromatic satellite images. This paper proposes a framework for aircraft type recognition by focusing on shape preservation, spatial transformations, and geospatial attributes derivation. First, we construct an aircraft segmentation model to obtain masks representing the shape of aircrafts by employing a learnable shape-preserved and deformable network in the mask RCNN architecture. Then, the orientation of the segmented aircrafts is determined by estimating the symmetrical axes using their gradient information. Besides template matching, we derive the length and width of aircrafts using the geotagged information of images to further categorize the types of aircrafts. Also, we present an effective inferencing mechanism to overcome the issue of partial detection or missing aircrafts in large-scale images. The efficacy of the proposed framework is demonstrated on large-scale panchromatic images with ground sampling distances of 0.65m (C2S).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.