As the size of vision models grows rapidly, the pre-training-then-fine-tuning paradigm becomes prohibitively expensive for model training and the dataset size required during the transfer learning stage. Inspired by the idea of a parameter-efficient transfer learning method in the natural language processing (NLP) domain, we proposed a lightweight, plug-and-play and trainable adapter (FCT-Adapter) to seamlessly fuse two heterogeneous networks, i.e., convolutional neural network (CNN) and Vision Transformer (ViT), and efficiently adapt during the transfer learning. The proposed FCT-Adapter serves to bridge CNN and ViT without adjusting the pre-trained backbone structure and parameters. Using only 0.75 million parameters for 19 tasks, the experimental results on the Visual Task Adaptation Benchmark show that the fused model Resnet-50 and ViT-B/16 via FCT-Adapters achieves an average accuracy of 77.4%, which is 12.3% higher than that by a single backbone through full fine-tuning. As the adapter contains a small number (less than 1% backbone parameters) of trainable parameters, the proposed method can achieve a plug-and-play function and be quickly deployed for various visual applications at low data volumes and storage costs. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Transformers
Convolution
Machine learning
Education and training
Visual process modeling
Visualization
Convolutional neural networks