Plug-and-play adapter for fusing convolutional neural network with vision transformer

Bin Chen; Xianlian Fan; Shiqian Wu

doi:10.1117/1.JEI.33.5.053050

21 October 2024 Plug-and-play adapter for fusing convolutional neural network with vision transformer

Bin Chen, Xianlian Fan, Shiqian Wu

Author Affiliations +

Journal of Electronic Imaging, Vol. 33, Issue 5, 053050 (October 2024). https://doi.org/10.1117/1.JEI.33.5.053050

Abstract

As the size of vision models grows rapidly, the pre-training-then-fine-tuning paradigm becomes prohibitively expensive for model training and the dataset size required during the transfer learning stage. Inspired by the idea of a parameter-efficient transfer learning method in the natural language processing (NLP) domain, we proposed a lightweight, plug-and-play and trainable adapter (FCT-Adapter) to seamlessly fuse two heterogeneous networks, i.e., convolutional neural network (CNN) and Vision Transformer (ViT), and efficiently adapt during the transfer learning. The proposed FCT-Adapter serves to bridge CNN and ViT without adjusting the pre-trained backbone structure and parameters. Using only 0.75 million parameters for 19 tasks, the experimental results on the Visual Task Adaptation Benchmark show that the fused model Resnet-50 and ViT-B/16 via FCT-Adapters achieves an average accuracy of 77.4%, which is 12.3% higher than that by a single backbone through full fine-tuning. As the adapter contains a small number (less than 1% backbone parameters) of trainable parameters, the proposed method can achieve a plug-and-play function and be quickly deployed for various visual applications at low data volumes and storage costs.

Citation Download Citation

Bin Chen, Xianlian Fan, and Shiqian Wu "Plug-and-play adapter for fusing convolutional neural network with vision transformer," Journal of Electronic Imaging 33(5), 053050 (21 October 2024). https://doi.org/10.1117/1.JEI.33.5.053050

Received: 13 June 2024; Accepted: 19 September 2024; Published: 21 October 2024

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $24.00

Non-members: $28.00 ADD TO CART

JOURNAL ARTICLE
17 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Transformers

Convolution

Machine learning

Education and training

Visual process modeling

Visualization

Convolutional neural networks

Show All Keywords

Keywords/Phrases

Search In:

Publication Years