first commit

This commit is contained in:
admin
2026-05-20 15:05:35 +08:00
commit ac09b26253
2048 changed files with 189478 additions and 0 deletions

View File

@@ -0,0 +1,82 @@
# MAE
> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
## Introduction
<!-- [BACKBONE] -->
<a href="https://github.com/facebookresearch/mae">Official Repo</a>
<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46">Code Snippet</a>
## Abstract
<!-- [ABSTRACT] -->
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
<!-- [IMAGE] -->
<div align=center>
<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>
</div>
## Usage
To use other repositories' pre-trained models, it is necessary to convert keys.
We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.
```shell
python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
```
E.g.
```shell
python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth
```
This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
In our default setting, pretrained models could be defined below:
| pretrained models | original models |
| ------------------------------- | ------------------------------------------------------------------------------------------------ |
| mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) |
Verify the single-scale results of the model:
```shell
sh tools/dist_test.sh \
configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
```
Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:
```shell
sh tools/dist_test.sh \
configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
```
## Results and models
### ADE20K
| Method | Backbone | Crop Size | pretrain | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU | mIoU(ms+flip) | config | download |
| ------- | -------- | --------- | ----------- | ----------------- | ---------- | ------- | -------- | -------------- | ------ | ----- | ------------: | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| UPerNet | ViT-B | 512x512 | ImageNet-1K | 224x224 | 16 | 160000 | 9.96 | 7.14 | V100 | 48.13 | 48.70 | [config](https://github.com/open-mmlab/mmsegmentation/blob/main/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) |
## Citation
```bibtex
@article{he2021masked,
title={Masked autoencoders are scalable vision learners},
author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},
journal={arXiv preprint arXiv:2111.06377},
year={2021}
}
```

View File

@@ -0,0 +1,16 @@
_base_ = './mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py'
test_pipeline = [
dict(type='LoadImageFromFile'),
# TODO: Refactor 'MultiScaleFlipAug' which supports
# `min_size` feature in `Resize` class
# img_ratios is [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
# original image scale is (2048, 512)
dict(type='Resize', scale=(2048, 512), keep_ratio=True),
# add loading annotation after ``Resize`` because ground truth
# does not need to do resize data transform
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='PackSegInputs')
]
val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader

View File

@@ -0,0 +1,53 @@
_base_ = [
'../_base_/models/upernet_mae.py', '../_base_/datasets/ade20k.py',
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
]
crop_size = (512, 512)
data_preprocessor = dict(size=crop_size)
model = dict(
data_preprocessor=data_preprocessor,
pretrained='./pretrain/mae_pretrain_vit_base_mmcls.pth',
backbone=dict(
type='MAE',
img_size=(512, 512),
patch_size=16, #
embed_dims=768, #
num_layers=12, #
num_heads=12, #
mlp_ratio=4, #
init_values=1.0,
drop_path_rate=0.1, #
out_indices=[3, 5, 7, 11]),
neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]),
decode_head=dict(in_channels=[768, 768, 768, 768], num_classes=150, channels=768),
auxiliary_head=dict(in_channels=768, num_classes=150),
test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
optim_wrapper = dict(
_delete_=True,
type='OptimWrapper',
optimizer=dict(
type='AdamW', lr=1e-4, betas=(0.9, 0.999), weight_decay=0.05),
paramwise_cfg=dict(num_layers=12, layer_decay_rate=0.65),
constructor='LayerDecayOptimizerConstructor')
param_scheduler = [
dict(
type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
dict(
type='PolyLR',
eta_min=0.0,
power=1.0,
begin=1500,
end=160000,
by_epoch=False,
)
]
# mixed precision
fp16 = dict(loss_scale='dynamic')
# By default, models are trained on 8 GPUs with 2 images per GPU
train_dataloader = dict(batch_size=4)
val_dataloader = dict(batch_size=1)
test_dataloader = val_dataloader

View File

@@ -0,0 +1,25 @@
Models:
- Name: mae-base_upernet_8xb2-amp-160k_ade20k-512x512
In Collection: UPerNet
Results:
Task: Semantic Segmentation
Dataset: ADE20K
Metrics:
mIoU: 48.13
mIoU(ms+flip): 48.7
Config: configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py
Metadata:
Training Data: ADE20K
Batch Size: 16
Architecture:
- ViT-B
- UPerNet
Training Resources: 8x V100 GPUS
Memory (GB): 9.96
Weights: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth
Training log: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json
Paper:
Title: Masked Autoencoders Are Scalable Vision Learners
URL: https://arxiv.org/abs/2111.06377
Code: https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46
Framework: PyTorch

View File

@@ -0,0 +1,130 @@
_base_ = [
'../_base_/models/upernet_mae.py',
'../_base_/datasets/my_dataset_model.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_40k_check_4000.py',
]
norm_cfg = dict(
type='BN',
)
crop_size = (512, 512)
data_preprocessor = dict(
size=(512, 512),
mean=[
94.94709810464303,
61.72942233949928,
75.93763705236906,
],
std=[
44.005506081132594,
42.69595666984776,
44.99354156225523,
],
bgr_to_rgb=False,
)
model = dict(
backbone=dict(
type='MAE',
img_size=(512, 512),
init_values=1.0,
# out_indices=[3, 5, 7, 11]
),
data_preprocessor=dict(
size=(512, 512),
mean=[
94.94709810464303,
61.72942233949928,
75.93763705236906,
],
std=[
44.005506081132594,
42.69595666984776,
44.99354156225523,
],
bgr_to_rgb=False,
),
pretrained='./My_Local_Model/pretrain/mae_pretrain_vit_base_mmcls.pth',
decode_head=dict(
in_channels=[
768,
768,
768,
768,
],
channels=768,
num_classes=36,
norm_cfg=dict(
type='BN',
),
loss_decode=dict(
type='DiceLoss',
use_sigmoid=False,
loss_weight=1.0,
),
),
neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]), # out_indices=[3, 5, 7, 11]
auxiliary_head=dict(
in_channels=768,
norm_cfg=dict(
type='BN',
),
num_classes=36,
loss_decode=dict(
type='DiceLoss',
use_sigmoid=False,
loss_weight=0.4,
),
),
test_cfg=dict(
mode='slide',
crop_size=(512, 512),
stride=(341, 341),
),
)
fp16 = dict(
loss_scale='dynamic',
)
optim_wrapper = dict(
_delete_=True,
type='AmpOptimWrapper',
optimizer=dict(
type='AdamW',
lr=3e-05,
betas=(0.9, 0.999),
weight_decay=0.05,
),
constructor='LayerDecayOptimizerConstructor',
paramwise_cfg=dict(
num_layers=12,
layer_decay_rate=0.9,
),
)
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-06,
by_epoch=False,
begin=0,
end=1500,
),
dict(
type='PolyLR',
power=0.9,
begin=1500,
end=40000,
eta_min=1e-05,
by_epoch=False,
),
]
train_dataloader = dict(
batch_size=2,
)

View File

@@ -0,0 +1,137 @@
_base_ = [
'../_base_/models/upernet_mae.py',
'../_base_/datasets/my_dataset_model.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_40k_check_4000.py',
]
norm_cfg = dict(
type='BN',
)
crop_size = (512, 512)
data_preprocessor = dict(
size=(512, 512),
mean=[
94.94709810464303,
61.72942233949928,
75.93763705236906,
],
std=[
44.005506081132594,
42.69595666984776,
44.99354156225523,
],
bgr_to_rgb=False,
)
model = dict(
backbone=dict(
type='MAE',
img_size=(512, 512),
init_values=1.0,
),
data_preprocessor=dict(
size=(512, 512),
mean=[
94.94709810464303,
61.72942233949928,
75.93763705236906,
],
std=[
44.005506081132594,
42.69595666984776,
44.99354156225523,
],
bgr_to_rgb=False,
),
pretrained='./My_Local_Model/pretrain/mae_pretrain_vit_base_mmcls.pth',
decode_head=dict(
in_channels=[
768,
768,
768,
768,
],
channels=768,
num_classes=36,
norm_cfg=dict(
type='BN',
),
loss_decode=dict(
type='DiceLoss',
use_sigmoid=False,
loss_weight=1.0,
),
),
neck=dict(
embed_dim=768,
rescales=[
4,
2,
1,
0.5,
],
),
auxiliary_head=dict(
in_channels=768,
norm_cfg=dict(
type='BN',
),
num_classes=36,
loss_decode=dict(
type='DiceLoss',
use_sigmoid=False,
loss_weight=0.4,
),
),
test_cfg=dict(
mode='slide',
crop_size=(512, 512),
stride=(341, 341),
),
)
fp16 = dict(
loss_scale='dynamic',
)
optim_wrapper = dict(
_delete_=True,
type='AmpOptimWrapper',
optimizer=dict(
type='AdamW',
lr=3e-05,
betas=(0.9, 0.999),
weight_decay=0.05,
),
constructor='LayerDecayOptimizerConstructor',
paramwise_cfg=dict(
num_layers=12,
layer_decay_rate=0.9,
),
)
param_scheduler = [
dict(
type='LinearLR',
start_factor=1e-06,
by_epoch=False,
begin=0,
end=1500,
),
dict(
type='PolyLR',
power=0.9,
begin=1500,
end=40000,
eta_min=1e-05,
by_epoch=False,
),
]
train_dataloader = dict(
batch_size=4,
)