CNN-Tranformer Hybrid models, combining the strengths of Transformers in capturing global context and CNNs in local feature extraction, have become an appealing direction in vision perception. However, hybrid models still face the significant challenge of minimizing computing expenses and balancing computational throughput and accuracy. This paper proposes an efficient CNN-Transformer hybrid model that improves throughput and memory consumption with high accuracy, named HTViT. Based on the three-stage architecture of LeViT, HTViT introduced a sparse cascaded group attention mechanism and global-local downsampling modules. The sparse cascaded group attention mechanism compresses the key and value in each group attention by the local aggregation to improve throughput and memory consumption. The global-local downsampling module introduces multi-scale convolution downsampling to enhance the local features and retain more valuable information to improve model performance. Comparison experiments with SOTA efficient hybrid models are conducted separately on CIFAR-10, STL-10, and Imagenette datasets. The experimental results demonstrate that HTViT significantly outperforms the baseline model LeViT and better balances the model size, throughput, memory consumption, and accuracy than other hybrid models.
|