Poster + Paper
22 November 2024 HTViT: an efficient CNN-Transformer hybrid model with high throughput
Author Affiliations +
Conference Poster
Abstract
CNN-Tranformer Hybrid models, combining the strengths of Transformers in capturing global context and CNNs in local feature extraction, have become an appealing direction in vision perception. However, hybrid models still face the significant challenge of minimizing computing expenses and balancing computational throughput and accuracy. This paper proposes an efficient CNN-Transformer hybrid model that improves throughput and memory consumption with high accuracy, named HTViT. Based on the three-stage architecture of LeViT, HTViT introduced a sparse cascaded group attention mechanism and global-local downsampling modules. The sparse cascaded group attention mechanism compresses the key and value in each group attention by the local aggregation to improve throughput and memory consumption. The global-local downsampling module introduces multi-scale convolution downsampling to enhance the local features and retain more valuable information to improve model performance. Comparison experiments with SOTA efficient hybrid models are conducted separately on CIFAR-10, STL-10, and Imagenette datasets. The experimental results demonstrate that HTViT significantly outperforms the baseline model LeViT and better balances the model size, throughput, memory consumption, and accuracy than other hybrid models.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Kun Ren, Tianyang Zhang, Xi Li, Yongping Du, and Honggui Han "HTViT: an efficient CNN-Transformer hybrid model with high throughput", Proc. SPIE 13239, Optoelectronic Imaging and Multimedia Technology XI, 132391E (22 November 2024); https://doi.org/10.1117/12.3036323
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Convolution

Computer hardware

Data transmission

Image classification

Object detection

Visualization

Back to Top