Stitched ViTs are Flexible Vision Backbones

ZIP Lab, Monash University
Teaser image.

SN-Netv2 extend the framework of stitchable neural networks (SN-Net) into downstream dense prediction tasks.


Abstract

Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs.

In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for better sampling. Finally, we observe that learning stitching layers as a low-rank update plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier.

With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone, achieving great advantages in both training efficiency and deployment flexibility. Code is available at this https URL.

ImageNet-1K

Based on DeiT3 model family, we finetune SN-Netv2 on ImageNet-1K with 50 epochs and compare it with V1.

ADE20K

Based on DeiT3-Small/Base/Large, we finetune SN-Netv2 on ADE20K with 160K iterations and compare it with V1.

COCO-Stuff-10K

Based on DeiT3-Small/Base/Large, we finetune SN-Netv2 on COCO-Stuff-10K with 80K iterations and compare it with V1.

Depth Estimation on NYUv2

Training Efficiency Comparison

BibTeX

@article{pan2023snnetv2,
  author    = {Pan, Zizheng and Liu, Jing and He, Haoyu and Cai, Jianfei and Zhuang, Bohan},
  title     = {Stitched ViTs are Flexible Vision Backbones},
  journal   = {arXiv},
  year      = {2023},
}