HTViT: an efficient CNN-Transformer hybrid model with high throughput

Kun Ren; Tianyang Zhang; Xi Li; Yongping Du; Honggui Han

doi:10.1117/12.3036323

22 November 2024 HTViT: an efficient CNN-Transformer hybrid model with high throughput

Kun Ren, Tianyang Zhang, Xi Li, Yongping Du, Honggui Han

Proceedings Volume 13239, Optoelectronic Imaging and Multimedia Technology XI; 132391E (2024) https://doi.org/10.1117/12.3036323
Event: SPIE/COS Photonics Asia, 2024, Nantong, Jiangsu, China

Conference Poster

Abstract

CNN-Tranformer Hybrid models, combining the strengths of Transformers in capturing global context and CNNs in local feature extraction, have become an appealing direction in vision perception. However, hybrid models still face the significant challenge of minimizing computing expenses and balancing computational throughput and accuracy. This paper proposes an efficient CNN-Transformer hybrid model that improves throughput and memory consumption with high accuracy, named HTViT. Based on the three-stage architecture of LeViT, HTViT introduced a sparse cascaded group attention mechanism and global-local downsampling modules. The sparse cascaded group attention mechanism compresses the key and value in each group attention by the local aggregation to improve throughput and memory consumption. The global-local downsampling module introduces multi-scale convolution downsampling to enhance the local features and retain more valuable information to improve model performance. Comparison experiments with SOTA efficient hybrid models are conducted separately on CIFAR-10, STL-10, and Imagenette datasets. The experimental results demonstrate that HTViT significantly outperforms the baseline model LeViT and better balances the model size, throughput, memory consumption, and accuracy than other hybrid models.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Kun Ren, Tianyang Zhang, Xi Li, Yongping Du, and Honggui Han "HTViT: an efficient CNN-Transformer hybrid model with high throughput", Proc. SPIE 13239, Optoelectronic Imaging and Multimedia Technology XI, 132391E (22 November 2024); https://doi.org/10.1117/12.3036323

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available