21 October 2024 Plug-and-play adapter for fusing convolutional neural network with vision transformer
Bin Chen, Xianlian Fan, Shiqian Wu
Author Affiliations +
Abstract

As the size of vision models grows rapidly, the pre-training-then-fine-tuning paradigm becomes prohibitively expensive for model training and the dataset size required during the transfer learning stage. Inspired by the idea of a parameter-efficient transfer learning method in the natural language processing (NLP) domain, we proposed a lightweight, plug-and-play and trainable adapter (FCT-Adapter) to seamlessly fuse two heterogeneous networks, i.e., convolutional neural network (CNN) and Vision Transformer (ViT), and efficiently adapt during the transfer learning. The proposed FCT-Adapter serves to bridge CNN and ViT without adjusting the pre-trained backbone structure and parameters. Using only 0.75 million parameters for 19 tasks, the experimental results on the Visual Task Adaptation Benchmark show that the fused model Resnet-50 and ViT-B/16 via FCT-Adapters achieves an average accuracy of 77.4%, which is 12.3% higher than that by a single backbone through full fine-tuning. As the adapter contains a small number (less than 1% backbone parameters) of trainable parameters, the proposed method can achieve a plug-and-play function and be quickly deployed for various visual applications at low data volumes and storage costs.

© 2024 SPIE and IS&T
Bin Chen, Xianlian Fan, and Shiqian Wu "Plug-and-play adapter for fusing convolutional neural network with vision transformer," Journal of Electronic Imaging 33(5), 053050 (21 October 2024). https://doi.org/10.1117/1.JEI.33.5.053050
Received: 13 June 2024; Accepted: 19 September 2024; Published: 21 October 2024
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Convolution

Machine learning

Education and training

Visual process modeling

Visualization

Convolutional neural networks

Back to Top