VTR: an optimized vision transformer for SAR ATR acceleration on FPGA

Sachini Wickramasinghe; Dhruv Parikh; Bingyi Zhang; Rajgopal Kannan; Viktor Prasanna; Carl Busart

doi:10.1117/12.3013580

7 June 2024 VTR: an optimized vision transformer for SAR ATR acceleration on FPGA

Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, Carl Busart

Proceedings Volume 13030, Image Sensing Technologies: Materials, Devices, Systems, and Applications XI; 130300F (2024) https://doi.org/10.1117/12.3013580
Event: SPIE Defense + Commercial Sensing, 2024, National Harbor, Maryland, United States

Conference Poster

Abstract

Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the state-of-the-art in various computer vision applications, outperforming Convolutional Neural Networks (CNNs). However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality. The standard SAR datasets have a limited number of labeled training data, reducing the learning capability of ViTs (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without pre-training. To this end, we incorporate the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules into the ViT model. We directly train this model on SAR datasets to evaluate its effectiveness for SAR ATR applications. The proposed model, VTR (ViT for SAR ATR), is evaluated on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Experimental results show that the proposed VTR model achieves a classification accuracy of 95.96%, 93.47%, and 99.46% on MSTAR, SynthWakeSAR, and GBSAR datasets, respectively. VTR achieves accuracy comparable to the state-of-the-art models on MSTAR and GBSAR datasets with 1.1× and 36× smaller model sizes, respectively. On SynthWakeSAR dataset, VTR achieves a higher accuracy with a model size that is 17× smaller. Further, a novel FPGA accelerator is proposed for VTR, to enable real-time SAR ATR applications. Compared with the implementation of VTR on state-of-the-art CPU and GPU platforms, our FPGA implementation achieves latency reduction by a factor of 70× and 30×, respectively. For inference on small batch sizes, our FPGA implementation achieves a 2× higher throughput compared with GPU.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, and Carl Busart "VTR: an optimized vision transformer for SAR ATR acceleration on FPGA", Proc. SPIE 13030, Image Sensing Technologies: Materials, Devices, Systems, and Applications XI, 130300F (7 June 2024); https://doi.org/10.1117/12.3013580

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available