first commit
This commit is contained in:
67
Seg_All_In_One_MMSeg/configs/dpt/README.md
Normal file
67
Seg_All_In_One_MMSeg/configs/dpt/README.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# DPT
|
||||
|
||||
> [Vision Transformer for Dense Prediction](https://arxiv.org/abs/2103.13413)
|
||||
|
||||
## Introduction
|
||||
|
||||
<!-- [ALGORITHM] -->
|
||||
|
||||
<a href="https://github.com/isl-org/DPT">Official Repo</a>
|
||||
|
||||
<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.17.0/mmseg/models/decode_heads/dpt_head.py#L215">Code Snippet</a>
|
||||
|
||||
## Abstract
|
||||
|
||||
<!-- [ABSTRACT] -->
|
||||
|
||||
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at [this https URL](https://github.com/isl-org/DPT).
|
||||
|
||||
<!-- [IMAGE] -->
|
||||
|
||||
<div align=center>
|
||||
<img src="https://user-images.githubusercontent.com/24582831/142901057-00aabea5-dab4-43d3-a14a-5f73eb5dd9b9.png" width="80%"/>
|
||||
</div>
|
||||
|
||||
## Usage
|
||||
|
||||
To use other repositories' pre-trained models, it is necessary to convert keys.
|
||||
|
||||
We provide a script [`vit2mmseg.py`](../../tools/model_converters/vit2mmseg.py) in the tools directory to convert the key of models from [timm](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py) to MMSegmentation style.
|
||||
|
||||
```shell
|
||||
python tools/model_converters/vit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
|
||||
```
|
||||
|
||||
E.g.
|
||||
|
||||
```shell
|
||||
python tools/model_converters/vit2mmseg.py https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth pretrain/jx_vit_base_p16_224-80ecf9dd.pth
|
||||
```
|
||||
|
||||
This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
|
||||
|
||||
## Results and models
|
||||
|
||||
### ADE20K
|
||||
|
||||
| Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
|
||||
| ------ | -------- | --------- | ------: | -------- | -------------- | ------ | ----: | ------------: | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| DPT | ViT-B | 512x512 | 160000 | 8.09 | 10.41 | V100 | 46.97 | 48.34 | [config](https://github.com/open-mmlab/mmsegmentation/blob/main/configs/dpt/dpt_vit-b16_8xb2-160k_ade20k-512x512.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/dpt/dpt_vit-b16_512x512_160k_ade20k/dpt_vit-b16_512x512_160k_ade20k-db31cf52.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/dpt/dpt_vit-b16_512x512_160k_ade20k/dpt_vit-b16_512x512_160k_ade20k-20210809_172025.log.json) |
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@article{dosoViTskiy2020,
|
||||
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
|
||||
author={DosoViTskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
|
||||
journal={arXiv preprint arXiv:2010.11929},
|
||||
year={2020}
|
||||
}
|
||||
|
||||
@article{Ranftl2021,
|
||||
author = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
|
||||
title = {Vision Transformers for Dense Prediction},
|
||||
journal = {ArXiv preprint},
|
||||
year = {2021},
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,40 @@
|
||||
_base_ = [
|
||||
'../_base_/models/dpt_vit-b16.py', '../_base_/datasets/ade20k.py',
|
||||
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
|
||||
]
|
||||
crop_size = (512, 512)
|
||||
data_preprocessor = dict(size=crop_size)
|
||||
model = dict(data_preprocessor=data_preprocessor)
|
||||
# AdamW optimizer, no weight decay for position embedding & layer norm
|
||||
# in backbone
|
||||
|
||||
optim_wrapper = dict(
|
||||
_delete_=True,
|
||||
type='OptimWrapper',
|
||||
optimizer=dict(
|
||||
type='AdamW', lr=0.00006, betas=(0.9, 0.999), weight_decay=0.01),
|
||||
# 自定义参数
|
||||
paramwise_cfg=dict(
|
||||
custom_keys={
|
||||
'pos_embed': dict(decay_mult=0.), # 位置嵌入(positional embeddings)。decay_mult=0. 意味着对这些嵌入不应用权重衰减
|
||||
'cls_token': dict(decay_mult=0.), # 是在某些模型(如 Transformer 或 BERT)中,用于分类任务的特定 token
|
||||
'norm': dict(decay_mult=0.) # 对归一化层的参数也禁用了权重衰减
|
||||
}))
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
eta_min=0.0,
|
||||
power=1.0,
|
||||
begin=1500,
|
||||
end=160000,
|
||||
by_epoch=False,
|
||||
)
|
||||
]
|
||||
|
||||
# By default, models are trained on 8 GPUs with 2 images per GPU
|
||||
train_dataloader = dict(batch_size=2, num_workers=2)
|
||||
val_dataloader = dict(batch_size=1, num_workers=4)
|
||||
test_dataloader = val_dataloader
|
||||
37
Seg_All_In_One_MMSeg/configs/dpt/metafile.yaml
Normal file
37
Seg_All_In_One_MMSeg/configs/dpt/metafile.yaml
Normal file
@@ -0,0 +1,37 @@
|
||||
Collections:
|
||||
- Name: DPT
|
||||
License: Apache License 2.0
|
||||
Metadata:
|
||||
Training Data:
|
||||
- ADE20K
|
||||
Paper:
|
||||
Title: Vision Transformer for Dense Prediction
|
||||
URL: https://arxiv.org/abs/2103.13413
|
||||
README: configs/dpt/README.md
|
||||
Frameworks:
|
||||
- PyTorch
|
||||
Models:
|
||||
- Name: dpt_vit-b16_8xb2-160k_ade20k-512x512
|
||||
In Collection: DPT
|
||||
Results:
|
||||
Task: Semantic Segmentation
|
||||
Dataset: ADE20K
|
||||
Metrics:
|
||||
mIoU: 46.97
|
||||
mIoU(ms+flip): 48.34
|
||||
Config: configs/dpt/dpt_vit-b16_8xb2-160k_ade20k-512x512.py
|
||||
Metadata:
|
||||
Training Data: ADE20K
|
||||
Batch Size: 16
|
||||
Architecture:
|
||||
- ViT-B
|
||||
- DPT
|
||||
Training Resources: 8x V100 GPUS
|
||||
Memory (GB): 8.09
|
||||
Weights: https://download.openmmlab.com/mmsegmentation/v0.5/dpt/dpt_vit-b16_512x512_160k_ade20k/dpt_vit-b16_512x512_160k_ade20k-db31cf52.pth
|
||||
Training log: https://download.openmmlab.com/mmsegmentation/v0.5/dpt/dpt_vit-b16_512x512_160k_ade20k/dpt_vit-b16_512x512_160k_ade20k-20210809_172025.log.json
|
||||
Paper:
|
||||
Title: Vision Transformer for Dense Prediction
|
||||
URL: https://arxiv.org/abs/2103.13413
|
||||
Code: https://github.com/open-mmlab/mmsegmentation/blob/v0.17.0/mmseg/models/decode_heads/dpt_head.py#L215
|
||||
Framework: PyTorch
|
||||
@@ -0,0 +1,95 @@
|
||||
_base_ = [
|
||||
'../_base_/models/dpt_vit-b16.py',
|
||||
'../_base_/datasets/my_dataset_model.py',
|
||||
'../_base_/default_runtime.py',
|
||||
'../_base_/schedules/schedule_40k_check_4000.py',
|
||||
]
|
||||
|
||||
norm_cfg = dict(
|
||||
type='BN',
|
||||
)
|
||||
|
||||
crop_size = (1024, 1024)
|
||||
|
||||
data_preprocessor = dict(
|
||||
size=(1024, 1024),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
)
|
||||
|
||||
model = dict(
|
||||
data_preprocessor=dict(
|
||||
size=(1024, 1024),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
),
|
||||
pretrained='./My_Local_Model/pretrain/vit-b16_p16_224-80ecf9dd.pth',
|
||||
decode_head=dict(
|
||||
loss_decode=dict(
|
||||
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0
|
||||
),
|
||||
num_classes=36,
|
||||
),
|
||||
)
|
||||
|
||||
optim_wrapper = dict(
|
||||
type='OptimWrapper',
|
||||
_delete_=True,
|
||||
optimizer=dict(
|
||||
type='AdamW',
|
||||
lr=0.0001,
|
||||
weight_decay=0.0005,
|
||||
),
|
||||
clip_grad=dict(
|
||||
max_norm=1,
|
||||
norm_type=2,
|
||||
),
|
||||
paramwise_cfg=dict(
|
||||
pos_embed=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
cls_token=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
norm=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR',
|
||||
start_factor=1e-06,
|
||||
by_epoch=False,
|
||||
begin=0,
|
||||
end=1500,
|
||||
),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
power=0.9,
|
||||
begin=1500,
|
||||
end=40000,
|
||||
eta_min=1e-05,
|
||||
by_epoch=False,
|
||||
),
|
||||
]
|
||||
|
||||
@@ -0,0 +1,101 @@
|
||||
_base_ = [
|
||||
'../_base_/models/dpt_vit-b16.py',
|
||||
'../_base_/datasets/my_dataset_model.py',
|
||||
'../_base_/default_runtime.py',
|
||||
'../_base_/schedules/schedule_40k_check_4000.py',
|
||||
]
|
||||
|
||||
norm_cfg = dict(
|
||||
type='BN',
|
||||
)
|
||||
|
||||
crop_size = (1024, 1024)
|
||||
|
||||
data_preprocessor = dict(
|
||||
size=(1024, 1024),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
)
|
||||
|
||||
model = dict(
|
||||
data_preprocessor=dict(
|
||||
size=(1024, 1024),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
),
|
||||
pretrained='./My_Local_Model/pretrain/vit-b16_p16_224-80ecf9dd.pth',
|
||||
decode_head=dict(
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=1.0,
|
||||
),
|
||||
num_classes=36,
|
||||
),
|
||||
)
|
||||
|
||||
optim_wrapper = dict(
|
||||
type='OptimWrapper',
|
||||
_delete_=True,
|
||||
optimizer=dict(
|
||||
type='AdamW',
|
||||
lr=0.0001,
|
||||
weight_decay=0.0005,
|
||||
),
|
||||
clip_grad=dict(
|
||||
max_norm=1,
|
||||
norm_type=2,
|
||||
),
|
||||
paramwise_cfg=dict(
|
||||
pos_embed=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
cls_token=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
norm=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR',
|
||||
start_factor=1e-06,
|
||||
by_epoch=False,
|
||||
begin=0,
|
||||
end=1500,
|
||||
),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
power=0.9,
|
||||
begin=1500,
|
||||
end=40000,
|
||||
eta_min=1e-05,
|
||||
by_epoch=False,
|
||||
),
|
||||
]
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=1,
|
||||
)
|
||||
|
||||
@@ -0,0 +1,101 @@
|
||||
_base_ = [
|
||||
'../_base_/models/dpt_vit-b16.py',
|
||||
'../_base_/datasets/my_dataset_model.py',
|
||||
'../_base_/default_runtime.py',
|
||||
'../_base_/schedules/schedule_40k_check_4000.py',
|
||||
]
|
||||
|
||||
norm_cfg = dict(
|
||||
type='BN',
|
||||
)
|
||||
|
||||
crop_size = (512, 512)
|
||||
|
||||
data_preprocessor = dict(
|
||||
size=(512, 512),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
)
|
||||
|
||||
model = dict(
|
||||
data_preprocessor=dict(
|
||||
size=(512, 512),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
),
|
||||
pretrained='./My_Local_Model/pretrain/vit-b16_p16_224-80ecf9dd.pth',
|
||||
decode_head=dict(
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=1.0,
|
||||
),
|
||||
num_classes=36,
|
||||
),
|
||||
)
|
||||
|
||||
optim_wrapper = dict(
|
||||
type='OptimWrapper',
|
||||
_delete_=True,
|
||||
optimizer=dict(
|
||||
type='AdamW',
|
||||
lr=0.0001,
|
||||
weight_decay=0.0005,
|
||||
),
|
||||
clip_grad=dict(
|
||||
max_norm=1,
|
||||
norm_type=2,
|
||||
),
|
||||
paramwise_cfg=dict(
|
||||
pos_embed=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
cls_token=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
norm=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR',
|
||||
start_factor=1e-06,
|
||||
by_epoch=False,
|
||||
begin=0,
|
||||
end=1500,
|
||||
),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
power=0.9,
|
||||
begin=1500,
|
||||
end=40000,
|
||||
eta_min=1e-05,
|
||||
by_epoch=False,
|
||||
),
|
||||
]
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=28,
|
||||
)
|
||||
|
||||
@@ -0,0 +1,101 @@
|
||||
_base_ = [
|
||||
'../_base_/models/dpt_vit-b16.py',
|
||||
'../_base_/datasets/my_dataset_model.py',
|
||||
'../_base_/default_runtime.py',
|
||||
'../_base_/schedules/schedule_40k_check_4000.py',
|
||||
]
|
||||
|
||||
norm_cfg = dict(
|
||||
type='BN',
|
||||
)
|
||||
|
||||
crop_size = (1024, 1024)
|
||||
|
||||
data_preprocessor = dict(
|
||||
size=(1024, 1024),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
)
|
||||
|
||||
model = dict(
|
||||
data_preprocessor=dict(
|
||||
size=(1024, 1024),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
),
|
||||
pretrained='./My_Local_Model/pretrain/vit-b16_p16_224-80ecf9dd.pth',
|
||||
decode_head=dict(
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=1.0,
|
||||
),
|
||||
num_classes=36,
|
||||
),
|
||||
)
|
||||
|
||||
optim_wrapper = dict(
|
||||
type='OptimWrapper',
|
||||
_delete_=True,
|
||||
optimizer=dict(
|
||||
type='AdamW',
|
||||
lr=0.0001,
|
||||
weight_decay=0.0005,
|
||||
),
|
||||
clip_grad=dict(
|
||||
max_norm=1,
|
||||
norm_type=2,
|
||||
),
|
||||
paramwise_cfg=dict(
|
||||
pos_embed=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
cls_token=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
norm=dict(
|
||||
decay_mult=0.0,
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR',
|
||||
start_factor=1e-06,
|
||||
by_epoch=False,
|
||||
begin=0,
|
||||
end=1500,
|
||||
),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
power=0.9,
|
||||
begin=1500,
|
||||
end=40000,
|
||||
eta_min=1e-05,
|
||||
by_epoch=False,
|
||||
),
|
||||
]
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=4,
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user