first commit
This commit is contained in:
82
Seg_All_In_One_MMSeg/configs/mae/README.md
Normal file
82
Seg_All_In_One_MMSeg/configs/mae/README.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# MAE
|
||||
|
||||
> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
|
||||
|
||||
## Introduction
|
||||
|
||||
<!-- [BACKBONE] -->
|
||||
|
||||
<a href="https://github.com/facebookresearch/mae">Official Repo</a>
|
||||
|
||||
<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46">Code Snippet</a>
|
||||
|
||||
## Abstract
|
||||
|
||||
<!-- [ABSTRACT] -->
|
||||
|
||||
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
|
||||
|
||||
<!-- [IMAGE] -->
|
||||
|
||||
<div align=center>
|
||||
<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>
|
||||
</div>
|
||||
|
||||
## Usage
|
||||
|
||||
To use other repositories' pre-trained models, it is necessary to convert keys.
|
||||
|
||||
We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.
|
||||
|
||||
```shell
|
||||
python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
|
||||
```
|
||||
|
||||
E.g.
|
||||
|
||||
```shell
|
||||
python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth
|
||||
```
|
||||
|
||||
This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
|
||||
|
||||
In our default setting, pretrained models could be defined below:
|
||||
|
||||
| pretrained models | original models |
|
||||
| ------------------------------- | ------------------------------------------------------------------------------------------------ |
|
||||
| mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) |
|
||||
|
||||
Verify the single-scale results of the model:
|
||||
|
||||
```shell
|
||||
sh tools/dist_test.sh \
|
||||
configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \
|
||||
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
|
||||
```
|
||||
|
||||
Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:
|
||||
|
||||
```shell
|
||||
sh tools/dist_test.sh \
|
||||
configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \
|
||||
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
|
||||
```
|
||||
|
||||
## Results and models
|
||||
|
||||
### ADE20K
|
||||
|
||||
| Method | Backbone | Crop Size | pretrain | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
|
||||
| ------- | -------- | --------- | ----------- | ----------------- | ---------- | ------- | -------- | -------------- | ------ | ----- | ------------: | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| UPerNet | ViT-B | 512x512 | ImageNet-1K | 224x224 | 16 | 160000 | 9.96 | 7.14 | V100 | 48.13 | 48.70 | [config](https://github.com/open-mmlab/mmsegmentation/blob/main/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) |
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@article{he2021masked,
|
||||
title={Masked autoencoders are scalable vision learners},
|
||||
author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},
|
||||
journal={arXiv preprint arXiv:2111.06377},
|
||||
year={2021}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,16 @@
|
||||
_base_ = './mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py'
|
||||
|
||||
test_pipeline = [
|
||||
dict(type='LoadImageFromFile'),
|
||||
# TODO: Refactor 'MultiScaleFlipAug' which supports
|
||||
# `min_size` feature in `Resize` class
|
||||
# img_ratios is [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
|
||||
# original image scale is (2048, 512)
|
||||
dict(type='Resize', scale=(2048, 512), keep_ratio=True),
|
||||
# add loading annotation after ``Resize`` because ground truth
|
||||
# does not need to do resize data transform
|
||||
dict(type='LoadAnnotations', reduce_zero_label=True),
|
||||
dict(type='PackSegInputs')
|
||||
]
|
||||
val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
|
||||
test_dataloader = val_dataloader
|
||||
@@ -0,0 +1,53 @@
|
||||
_base_ = [
|
||||
'../_base_/models/upernet_mae.py', '../_base_/datasets/ade20k.py',
|
||||
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
|
||||
]
|
||||
crop_size = (512, 512)
|
||||
data_preprocessor = dict(size=crop_size)
|
||||
model = dict(
|
||||
data_preprocessor=data_preprocessor,
|
||||
pretrained='./pretrain/mae_pretrain_vit_base_mmcls.pth',
|
||||
backbone=dict(
|
||||
type='MAE',
|
||||
img_size=(512, 512),
|
||||
patch_size=16, #
|
||||
embed_dims=768, #
|
||||
num_layers=12, #
|
||||
num_heads=12, #
|
||||
mlp_ratio=4, #
|
||||
init_values=1.0,
|
||||
drop_path_rate=0.1, #
|
||||
out_indices=[3, 5, 7, 11]),
|
||||
neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]),
|
||||
decode_head=dict(in_channels=[768, 768, 768, 768], num_classes=150, channels=768),
|
||||
auxiliary_head=dict(in_channels=768, num_classes=150),
|
||||
test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
|
||||
|
||||
optim_wrapper = dict(
|
||||
_delete_=True,
|
||||
type='OptimWrapper',
|
||||
optimizer=dict(
|
||||
type='AdamW', lr=1e-4, betas=(0.9, 0.999), weight_decay=0.05),
|
||||
paramwise_cfg=dict(num_layers=12, layer_decay_rate=0.65),
|
||||
constructor='LayerDecayOptimizerConstructor')
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
eta_min=0.0,
|
||||
power=1.0,
|
||||
begin=1500,
|
||||
end=160000,
|
||||
by_epoch=False,
|
||||
)
|
||||
]
|
||||
|
||||
# mixed precision
|
||||
fp16 = dict(loss_scale='dynamic')
|
||||
|
||||
# By default, models are trained on 8 GPUs with 2 images per GPU
|
||||
train_dataloader = dict(batch_size=4)
|
||||
val_dataloader = dict(batch_size=1)
|
||||
test_dataloader = val_dataloader
|
||||
25
Seg_All_In_One_MMSeg/configs/mae/metafile.yaml
Normal file
25
Seg_All_In_One_MMSeg/configs/mae/metafile.yaml
Normal file
@@ -0,0 +1,25 @@
|
||||
Models:
|
||||
- Name: mae-base_upernet_8xb2-amp-160k_ade20k-512x512
|
||||
In Collection: UPerNet
|
||||
Results:
|
||||
Task: Semantic Segmentation
|
||||
Dataset: ADE20K
|
||||
Metrics:
|
||||
mIoU: 48.13
|
||||
mIoU(ms+flip): 48.7
|
||||
Config: configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py
|
||||
Metadata:
|
||||
Training Data: ADE20K
|
||||
Batch Size: 16
|
||||
Architecture:
|
||||
- ViT-B
|
||||
- UPerNet
|
||||
Training Resources: 8x V100 GPUS
|
||||
Memory (GB): 9.96
|
||||
Weights: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth
|
||||
Training log: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json
|
||||
Paper:
|
||||
Title: Masked Autoencoders Are Scalable Vision Learners
|
||||
URL: https://arxiv.org/abs/2111.06377
|
||||
Code: https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46
|
||||
Framework: PyTorch
|
||||
@@ -0,0 +1,130 @@
|
||||
_base_ = [
|
||||
'../_base_/models/upernet_mae.py',
|
||||
'../_base_/datasets/my_dataset_model.py',
|
||||
'../_base_/default_runtime.py',
|
||||
'../_base_/schedules/schedule_40k_check_4000.py',
|
||||
]
|
||||
|
||||
norm_cfg = dict(
|
||||
type='BN',
|
||||
)
|
||||
|
||||
crop_size = (512, 512)
|
||||
|
||||
data_preprocessor = dict(
|
||||
size=(512, 512),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
)
|
||||
|
||||
model = dict(
|
||||
backbone=dict(
|
||||
type='MAE',
|
||||
img_size=(512, 512),
|
||||
init_values=1.0,
|
||||
# out_indices=[3, 5, 7, 11]
|
||||
),
|
||||
data_preprocessor=dict(
|
||||
size=(512, 512),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
),
|
||||
pretrained='./My_Local_Model/pretrain/mae_pretrain_vit_base_mmcls.pth',
|
||||
decode_head=dict(
|
||||
in_channels=[
|
||||
768,
|
||||
768,
|
||||
768,
|
||||
768,
|
||||
],
|
||||
channels=768,
|
||||
num_classes=36,
|
||||
norm_cfg=dict(
|
||||
type='BN',
|
||||
),
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=1.0,
|
||||
),
|
||||
),
|
||||
neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]), # out_indices=[3, 5, 7, 11]
|
||||
auxiliary_head=dict(
|
||||
in_channels=768,
|
||||
norm_cfg=dict(
|
||||
type='BN',
|
||||
),
|
||||
num_classes=36,
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=0.4,
|
||||
),
|
||||
),
|
||||
test_cfg=dict(
|
||||
mode='slide',
|
||||
crop_size=(512, 512),
|
||||
stride=(341, 341),
|
||||
),
|
||||
)
|
||||
|
||||
fp16 = dict(
|
||||
loss_scale='dynamic',
|
||||
)
|
||||
|
||||
optim_wrapper = dict(
|
||||
_delete_=True,
|
||||
type='AmpOptimWrapper',
|
||||
optimizer=dict(
|
||||
type='AdamW',
|
||||
lr=3e-05,
|
||||
betas=(0.9, 0.999),
|
||||
weight_decay=0.05,
|
||||
),
|
||||
constructor='LayerDecayOptimizerConstructor',
|
||||
paramwise_cfg=dict(
|
||||
num_layers=12,
|
||||
layer_decay_rate=0.9,
|
||||
),
|
||||
)
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR',
|
||||
start_factor=1e-06,
|
||||
by_epoch=False,
|
||||
begin=0,
|
||||
end=1500,
|
||||
),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
power=0.9,
|
||||
begin=1500,
|
||||
end=40000,
|
||||
eta_min=1e-05,
|
||||
by_epoch=False,
|
||||
),
|
||||
]
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=2,
|
||||
)
|
||||
|
||||
@@ -0,0 +1,137 @@
|
||||
_base_ = [
|
||||
'../_base_/models/upernet_mae.py',
|
||||
'../_base_/datasets/my_dataset_model.py',
|
||||
'../_base_/default_runtime.py',
|
||||
'../_base_/schedules/schedule_40k_check_4000.py',
|
||||
]
|
||||
|
||||
norm_cfg = dict(
|
||||
type='BN',
|
||||
)
|
||||
|
||||
crop_size = (512, 512)
|
||||
|
||||
data_preprocessor = dict(
|
||||
size=(512, 512),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
)
|
||||
|
||||
model = dict(
|
||||
backbone=dict(
|
||||
type='MAE',
|
||||
img_size=(512, 512),
|
||||
init_values=1.0,
|
||||
),
|
||||
data_preprocessor=dict(
|
||||
size=(512, 512),
|
||||
mean=[
|
||||
94.94709810464303,
|
||||
61.72942233949928,
|
||||
75.93763705236906,
|
||||
],
|
||||
std=[
|
||||
44.005506081132594,
|
||||
42.69595666984776,
|
||||
44.99354156225523,
|
||||
],
|
||||
bgr_to_rgb=False,
|
||||
),
|
||||
pretrained='./My_Local_Model/pretrain/mae_pretrain_vit_base_mmcls.pth',
|
||||
decode_head=dict(
|
||||
in_channels=[
|
||||
768,
|
||||
768,
|
||||
768,
|
||||
768,
|
||||
],
|
||||
channels=768,
|
||||
num_classes=36,
|
||||
norm_cfg=dict(
|
||||
type='BN',
|
||||
),
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=1.0,
|
||||
),
|
||||
),
|
||||
neck=dict(
|
||||
embed_dim=768,
|
||||
rescales=[
|
||||
4,
|
||||
2,
|
||||
1,
|
||||
0.5,
|
||||
],
|
||||
),
|
||||
auxiliary_head=dict(
|
||||
in_channels=768,
|
||||
norm_cfg=dict(
|
||||
type='BN',
|
||||
),
|
||||
num_classes=36,
|
||||
loss_decode=dict(
|
||||
type='DiceLoss',
|
||||
use_sigmoid=False,
|
||||
loss_weight=0.4,
|
||||
),
|
||||
),
|
||||
test_cfg=dict(
|
||||
mode='slide',
|
||||
crop_size=(512, 512),
|
||||
stride=(341, 341),
|
||||
),
|
||||
)
|
||||
|
||||
fp16 = dict(
|
||||
loss_scale='dynamic',
|
||||
)
|
||||
|
||||
optim_wrapper = dict(
|
||||
_delete_=True,
|
||||
type='AmpOptimWrapper',
|
||||
optimizer=dict(
|
||||
type='AdamW',
|
||||
lr=3e-05,
|
||||
betas=(0.9, 0.999),
|
||||
weight_decay=0.05,
|
||||
),
|
||||
constructor='LayerDecayOptimizerConstructor',
|
||||
paramwise_cfg=dict(
|
||||
num_layers=12,
|
||||
layer_decay_rate=0.9,
|
||||
),
|
||||
)
|
||||
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type='LinearLR',
|
||||
start_factor=1e-06,
|
||||
by_epoch=False,
|
||||
begin=0,
|
||||
end=1500,
|
||||
),
|
||||
dict(
|
||||
type='PolyLR',
|
||||
power=0.9,
|
||||
begin=1500,
|
||||
end=40000,
|
||||
eta_min=1e-05,
|
||||
by_epoch=False,
|
||||
),
|
||||
]
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=4,
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user