first commit

This commit is contained in:
admin
2026-05-20 15:05:35 +08:00
commit ac09b26253
2048 changed files with 189478 additions and 0 deletions

View File

@@ -0,0 +1,47 @@
# SAN
> [Side Adapter Network for Open-Vocabulary Semantic Segmentation](https://arxiv.org/abs/2302.12242)
## Introduction
<!-- [ALGORITHM] -->
<a href="https://github.com/MendelXu/SAN">Official Repo</a>
## Abstract
<!-- [ABSTRACT] -->
This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation.
<!-- [IMAGE] -->
<div align=center>
<img src="https://github.com/MendelXu/SAN/blob/main/resources/arch.png" width="800"/>
</div>
## Results and models
### COCO-Stuff164k
| Method | Backbone | Pretrained | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
| ------ | -------- | ------------ | --------- | ------- | -------- | -------------- | ------ | ----- | ------------- | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| SAN | ViT-B_16 | CLIP_ViT-B16 | 640x640 | 60000 | 12.61 | - | V100 | 41.93 | 41.77 | https://github.com/open-mmlab/mmsegmentation/blob/main/configs/san/san-vit-b16_coco-stuff164k-640x640.py | [model](https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-b16_20230906-fd0a7684.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-b16_20230906.log) |
| SAN | ViT-L_14 | CLIP_ViT-L14 | 640x640 | 60000 | 22.84 | - | V100 | 45.78 | 43.99 | https://github.com/open-mmlab/mmsegmentation/blob/main/configs/san/san-vit-l14_coco-stuff164k-640x640.py | [model](https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-l14_20230907-a11e098f.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-l14_20230907.log) |
## Notes
git push
The pretrained weights in config files are converted from open_clip models using tools/model_converters/clip2mmseg.py.
## Citation
```bibtex
@inproceedings{xu2023side,
title={Side adapter network for open-vocabulary semantic segmentation},
author={Xu, Mengde and Zhang, Zheng and Wei, Fangyun and Hu, Han and Bai, Xiang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={2945--2954},
year={2023}
}
```

View File

@@ -0,0 +1,61 @@
Collections:
- Name: SAN
License: Apache License 2.0
Metadata:
Training Data:
- COCO-Stuff 164k
Paper:
Title: 'Side Adapter Network for Open-Vocabulary Semantic Segmentation'
URL: https://arxiv.org/abs/2302.12242
README: configs/san/README.md
Frameworks:
- PyTorch
Models:
- Name: san-vit-b16_coco-stuff164k-640x640
In Collection: SAN
Results:
Task: Semantic Segmentation
Dataset: COCO-Stuff 164k
Metrics:
mIoU: 41.93
mIoU(ms+flip): 41.77
Config: configs/san/san-vit-b16_coco-stuff164k-640x640.py
Metadata:
Training Data: COCO-Stuff 164k
Batch Size: 16
Architecture:
- SAN
- ViT
Training Resources: 8x V100 GPUS
Memory (GB): 12.61
Weights: https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-b16_20230906-fd0a7684.pth
Training log: https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-b16_20230906.log
Paper:
Title: 'Side Adapter Network for Open-Vocabulary Semantic Segmentation'
URL: https://arxiv.org/abs/2302.12242
Code: https://github.com/open-mmlab/mmsegmentation/blob/dev-1.x/mmseg/models/decode_heads/san_head.py#L470
Framework: PyTorch
- Name: san-vit-l14_coco-stuff164k-640x640
In Collection: SAN
Results:
Task: Semantic Segmentation
Dataset: COCO-Stuff 164k
Metrics:
mIoU: 45.78
mIoU(ms+flip): 43.99
Config: configs/san/san-vit-l14_coco-stuff164k-640x640.py
Metadata:
Training Data: COCO-Stuff 164k
Batch Size: 16
Architecture:
- SAN
- ViT
Training Resources: 8x V100 GPUS
Memory (GB): 12.61
Weights: https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-l14_20230907-a11e098f.pth
Training log: https://download.openmmlab.com/mmsegmentation/v0.5/san/san-vit-l14_20230907.log
Paper:
Title: 'Side Adapter Network for Open-Vocabulary Semantic Segmentation'
URL: https://arxiv.org/abs/2302.12242
Code: https://github.com/open-mmlab/mmsegmentation/blob/dev-1.x/mmseg/models/decode_heads/san_head.py#L470
Framework: PyTorch

View File

@@ -0,0 +1,82 @@
_base_ = [
'../_base_/models/san_vit-b16.py', '../_base_/datasets/coco-stuff164k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
]
crop_size = (640, 640)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(
type='RandomChoiceResize',
scales=[int(640 * x * 0.1) for x in range(5, 16)],
resize_type='ResizeShortestEdge',
max_size=2560),
dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=1.0),
dict(type='PhotoMetricDistortion'),
dict(type='RandomFlip', prob=0.5),
dict(type='PackSegInputs')
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='ResizeShortestEdge', scale=crop_size, max_size=2560),
dict(type='LoadAnnotations'),
dict(type='PackSegInputs')
]
# By default, models are trained on 4 GPUs with 8 images per GPU
train_dataloader = dict(batch_size=8, dataset=dict(pipeline=train_pipeline))
val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
pretrained = 'https://download.openmmlab.com/mmsegmentation/v0.5/san/clip_vit-base-patch16-224_3rdparty-d08f8887.pth' # noqa
data_preprocessor = dict(
mean=[122.7709, 116.7460, 104.0937],
std=[68.5005, 66.6322, 70.3232],
size_divisor=640,
test_cfg=dict(size_divisor=32))
model = dict(
pretrained=pretrained,
text_encoder=dict(dataset_name='coco-stuff164k'),
decode_head=dict(num_classes=171))
# training schedule for 60k
train_cfg = dict(
type='IterBasedTrainLoop',
max_iters=60000,
val_interval=500,
val_begin=55000)
default_hooks = dict(
checkpoint=dict(
type='CheckpointHook',
by_epoch=False,
interval=10000,
save_best='mIoU'))
# AdamW optimizer, no weight decay for position embedding & layer norm
# in backbone
optim_wrapper = dict(
_delete_=True,
type='AmpOptimWrapper',
optimizer=dict(
type='AdamW', lr=0.0001, betas=(0.9, 0.999), weight_decay=0.0001),
paramwise_cfg=dict(
custom_keys={
'img_encoder': dict(lr_mult=0.1, decay_mult=1.0),
'pos_embed': dict(decay_mult=0.),
'cls_token': dict(decay_mult=0.),
'norm': dict(decay_mult=0.)
}),
loss_scale='dynamic',
clip_grad=dict(max_norm=0.01, norm_type=2))
param_scheduler = [
dict(
type='PolyLR',
eta_min=0.0,
power=1.0,
begin=0,
end=60000,
by_epoch=False,
)
]

View File

@@ -0,0 +1,56 @@
_base_ = [
'../_base_/models/san_vit-b16.py',
'../_base_/datasets/pascal_context_59.py', '../_base_/default_runtime.py',
'../_base_/schedules/schedule_160k.py'
]
crop_size = (640, 640)
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='ResizeShortestEdge', scale=crop_size, max_size=2560),
dict(type='LoadAnnotations'),
dict(type='PackSegInputs')
]
# By default, models are trained on 8 GPUs with 2 images per GPU
train_dataloader = dict(batch_size=2)
val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
data_preprocessor = dict(
mean=[122.7709, 116.7460, 104.0937],
std=[68.5005, 66.6322, 70.3232],
size_divisor=640,
test_cfg=dict(size_divisor=32))
model = dict(
data_preprocessor=data_preprocessor,
pretrained='pretrain/vit_base_patch16_224.pth',
text_encoder=dict(dataset_name='pascal_context'),
decode_head=dict(num_classes=59))
# AdamW optimizer, no weight decay for position embedding & layer norm
# in backbone
optim_wrapper = dict(
_delete_=True,
type='OptimWrapper',
optimizer=dict(
type='AdamW', lr=0.00006, betas=(0.9, 0.999), weight_decay=0.01),
paramwise_cfg=dict(
custom_keys={
'pos_embed': dict(decay_mult=0.),
'cls_token': dict(decay_mult=0.),
'norm': dict(decay_mult=0.)
}))
param_scheduler = [
dict(
type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
dict(
type='PolyLR',
eta_min=0.0,
power=1.0,
begin=1500,
end=160000,
by_epoch=False,
)
]

View File

@@ -0,0 +1,65 @@
_base_ = [
'../_base_/models/san_vit-b16.py',
'../_base_/datasets/pascal_voc12_aug.py', '../_base_/default_runtime.py',
'../_base_/schedules/schedule_160k.py'
]
crop_size = (640, 640)
metainfo = dict(
classes=('aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car',
'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike',
'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'),
palette=[[128, 0, 0], [0, 128, 0], [128, 128, 0], [0, 0, 128],
[128, 0, 128], [0, 128, 128], [128, 128, 128], [64, 0, 0],
[192, 0, 0], [64, 128, 0], [192, 128, 0], [64, 0, 128],
[192, 0, 128], [64, 128, 128], [192, 128, 128], [0, 64, 0],
[128, 64, 0], [0, 192, 0], [128, 192, 0], [0, 64, 128]])
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='ResizeShortestEdge', scale=crop_size, max_size=2560),
dict(type='LoadAnnotations'),
dict(type='PackSegInputs')
]
# By default, models are trained on 8 GPUs with 2 images per GPU
train_dataloader = dict(batch_size=2)
val_dataloader = dict(
batch_size=1, dataset=dict(metainfo=metainfo, pipeline=test_pipeline))
test_dataloader = val_dataloader
data_preprocessor = dict(
mean=[122.7709, 116.7460, 104.0937],
std=[68.5005, 66.6322, 70.3232],
size_divisor=640,
test_cfg=dict(size_divisor=32))
model = dict(
data_preprocessor=data_preprocessor,
pretrained='pretrain/vit_base_patch16_224.pth',
text_encoder=dict(dataset_name='voc'),
decode_head=dict(num_classes=20))
# AdamW optimizer, no weight decay for position embedding & layer norm
# in backbone
optim_wrapper = dict(
_delete_=True,
type='OptimWrapper',
optimizer=dict(
type='AdamW', lr=0.00006, betas=(0.9, 0.999), weight_decay=0.01),
paramwise_cfg=dict(
custom_keys={
'pos_embed': dict(decay_mult=0.),
'cls_token': dict(decay_mult=0.),
'norm': dict(decay_mult=0.)
}))
param_scheduler = [
dict(
type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
dict(
type='PolyLR',
eta_min=0.0,
power=1.0,
begin=1500,
end=160000,
by_epoch=False,
)
]

View File

@@ -0,0 +1,36 @@
_base_ = ['./san-vit-b16_coco-stuff164k-640x640.py']
pretrained = 'https://download.openmmlab.com/mmsegmentation/v0.5/san/clip_vit-large-patch14-336_3rdparty-0b5df9cb.pth' # noqa
model = dict(
type='MultimodalEncoderDecoder',
pretrained=pretrained,
encoder_resolution=0.7,
image_encoder=dict(
type='VisionTransformer',
img_size=(336, 336),
patch_size=14,
patch_pad=0,
embed_dims=1024,
num_layers=18,
num_heads=16,
out_indices=(5, 11, 17),
),
text_encoder=dict(
type='CLIPTextEncoder',
embed_dims=768,
num_layers=12,
num_heads=12,
output_dims=768,
),
decode_head=dict(
type='SideAdapterCLIPHead',
san_cfg=dict(clip_channels=1024, cfg_decoder=dict(num_heads=16)),
maskgen_cfg=dict(
num_layers=6,
embed_dims=1024,
num_heads=16,
out_dims=768,
)))
# By default, models are trained on 8 GPUs with 4 images per GPU
train_dataloader = dict(batch_size=4)

View File

@@ -0,0 +1,32 @@
_base_ = ['./san-vit-b16_pascal_context-640x640.py']
model = dict(
type='MultimodalEncoderDecoder',
pretrained='pretrain/jx_vit_base_p16_224-80ecf9dd.pth',
encoder_resolution=0.7,
image_encoder=dict(
type='VisionTransformer',
img_size=(336, 336),
patch_size=14,
patch_pad=0,
embed_dims=1024,
num_layers=18,
num_heads=16,
out_indices=(5, 11, 17),
),
text_encoder=dict(
type='CLIPTextEncoder',
embed_dims=768,
num_layers=12,
num_heads=12,
output_dims=768,
),
decode_head=dict(
type='SideAdapterCLIPHead',
san_cfg=dict(clip_channels=1024, cfg_decoder=dict(num_heads=16)),
maskgen_cfg=dict(
num_layers=6,
embed_dims=1024,
num_heads=16,
out_dims=768,
)))

View File

@@ -0,0 +1,32 @@
_base_ = ['./san-vit-b16_voc12aug-640x640.py']
model = dict(
type='MultimodalEncoderDecoder',
pretrained='pretrain/jx_vit_base_p16_224-80ecf9dd.pth',
encoder_resolution=0.7,
image_encoder=dict(
type='VisionTransformer',
img_size=(336, 336),
patch_size=14,
patch_pad=0,
embed_dims=1024,
num_layers=18,
num_heads=16,
out_indices=(5, 11, 17),
),
text_encoder=dict(
type='CLIPTextEncoder',
embed_dims=768,
num_layers=12,
num_heads=12,
output_dims=768,
),
decode_head=dict(
type='SideAdapterCLIPHead',
san_cfg=dict(clip_channels=1024, cfg_decoder=dict(num_heads=16)),
maskgen_cfg=dict(
num_layers=6,
embed_dims=1024,
num_heads=16,
out_dims=768,
)))