first commit

2026-05-20 15:05:35 +08:00
commit ac09b26253
2048 changed files with 189478 additions and 0 deletions
--- a/Seg_All_In_One_MMSeg/configs/mae/README.md
+++ b/Seg_All_In_One_MMSeg/configs/mae/README.md
@@ -0,0 +1,82 @@
+# MAE
+
+> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
+
+## Introduction
+
+<!-- [BACKBONE] -->
+
+<a href="https://github.com/facebookresearch/mae">Official Repo</a>
+
+<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46">Code Snippet</a>
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>
+</div>
+
+## Usage
+
+To use other repositories' pre-trained models, it is necessary to convert keys.
+
+We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.
+
+```shell
+python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
+```
+
+E.g.
+
+```shell
+python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth
+```
+
+This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
+
+In our default setting, pretrained models could be defined below:
+
+| pretrained models               | original models                                                                                  |
+| ------------------------------- | ------------------------------------------------------------------------------------------------ |
+| mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) |
+
+Verify the single-scale results of the model:
+
+```shell
+sh tools/dist_test.sh \
+configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \
+upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
+```
+
+Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:
+
+```shell
+sh tools/dist_test.sh \
+configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \
+upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
+```
+
+## Results and models
+
+### ADE20K
+
+| Method  | Backbone | Crop Size | pretrain    | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | Device | mIoU  | mIoU(ms+flip) | config                                                                                                                        | download                                                                                                                                                                                                                                                                                                                                                                       |
+| ------- | -------- | --------- | ----------- | ----------------- | ---------- | ------- | -------- | -------------- | ------ | ----- | ------------: | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| UPerNet | ViT-B    | 512x512   | ImageNet-1K | 224x224           | 16         | 160000  | 9.96     | 7.14           | V100   | 48.13 |         48.70 | [config](https://github.com/open-mmlab/mmsegmentation/blob/main/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) |
+
+## Citation
+
+```bibtex
+@article{he2021masked,
+  title={Masked autoencoders are scalable vision learners},
+  author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},
+  journal={arXiv preprint arXiv:2111.06377},
+  year={2021}
+}
+```
--- a/Seg_All_In_One_MMSeg/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512-ms.py
+++ b/Seg_All_In_One_MMSeg/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512-ms.py
@@ -0,0 +1,16 @@
+_base_ = './mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py'
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    # TODO: Refactor 'MultiScaleFlipAug' which supports
+    # `min_size` feature in `Resize` class
+    # img_ratios is [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
+    # original image scale is (2048, 512)
+    dict(type='Resize', scale=(2048, 512), keep_ratio=True),
+    # add loading annotation after ``Resize`` because ground truth
+    # does not need to do resize data transform
+    dict(type='LoadAnnotations', reduce_zero_label=True),
+    dict(type='PackSegInputs')
+]
+val_dataloader = dict(batch_size=1, dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
--- a/Seg_All_In_One_MMSeg/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py
+++ b/Seg_All_In_One_MMSeg/configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py
@@ -0,0 +1,53 @@
+_base_ = [
+    '../_base_/models/upernet_mae.py', '../_base_/datasets/ade20k.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
+]
+crop_size = (512, 512)
+data_preprocessor = dict(size=crop_size)
+model = dict(
+    data_preprocessor=data_preprocessor,
+    pretrained='./pretrain/mae_pretrain_vit_base_mmcls.pth',
+    backbone=dict(
+        type='MAE',
+        img_size=(512, 512),
+        patch_size=16, #
+        embed_dims=768, #
+        num_layers=12, #
+        num_heads=12, #
+        mlp_ratio=4, #
+        init_values=1.0,
+        drop_path_rate=0.1, #
+        out_indices=[3, 5, 7, 11]),
+    neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]),
+    decode_head=dict(in_channels=[768, 768, 768, 768], num_classes=150, channels=768),
+    auxiliary_head=dict(in_channels=768, num_classes=150),
+    test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
+
+optim_wrapper = dict(
+    _delete_=True,
+    type='OptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=1e-4, betas=(0.9, 0.999), weight_decay=0.05),
+    paramwise_cfg=dict(num_layers=12, layer_decay_rate=0.65),
+    constructor='LayerDecayOptimizerConstructor')
+
+param_scheduler = [
+    dict(
+        type='LinearLR', start_factor=1e-6, by_epoch=False, begin=0, end=1500),
+    dict(
+        type='PolyLR',
+        eta_min=0.0,
+        power=1.0,
+        begin=1500,
+        end=160000,
+        by_epoch=False,
+    )
+]
+
+# mixed precision
+fp16 = dict(loss_scale='dynamic')
+
+# By default, models are trained on 8 GPUs with 2 images per GPU
+train_dataloader = dict(batch_size=4)
+val_dataloader = dict(batch_size=1)
+test_dataloader = val_dataloader
--- a/Seg_All_In_One_MMSeg/configs/mae/metafile.yaml
+++ b/Seg_All_In_One_MMSeg/configs/mae/metafile.yaml
@@ -0,0 +1,25 @@
+Models:
+- Name: mae-base_upernet_8xb2-amp-160k_ade20k-512x512
+  In Collection: UPerNet
+  Results:
+    Task: Semantic Segmentation
+    Dataset: ADE20K
+    Metrics:
+      mIoU: 48.13
+      mIoU(ms+flip): 48.7
+  Config: configs/mae/mae-base_upernet_8xb2-amp-160k_ade20k-512x512.py
+  Metadata:
+    Training Data: ADE20K
+    Batch Size: 16
+    Architecture:
+    - ViT-B
+    - UPerNet
+    Training Resources: 8x V100 GPUS
+    Memory (GB): 9.96
+  Weights: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth
+  Training log: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json
+  Paper:
+    Title: Masked Autoencoders Are Scalable Vision Learners
+    URL: https://arxiv.org/abs/2111.06377
+  Code: https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46
+  Framework: PyTorch
--- a/Seg_All_In_One_MMSeg/configs/mae/my_mae_upernet-base_b2_g1-40k_check_4000_my_dataset_model-512x512-testslide.py
+++ b/Seg_All_In_One_MMSeg/configs/mae/my_mae_upernet-base_b2_g1-40k_check_4000_my_dataset_model-512x512-testslide.py
@@ -0,0 +1,130 @@
+_base_ = [
+    '../_base_/models/upernet_mae.py',
+    '../_base_/datasets/my_dataset_model.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/schedule_40k_check_4000.py',
+]
+
+norm_cfg = dict(
+    type='BN',
+)
+
+crop_size = (512, 512)
+
+data_preprocessor = dict(
+    size=(512, 512),
+    mean=[
+        94.94709810464303,
+        61.72942233949928,
+        75.93763705236906,
+    ],
+    std=[
+        44.005506081132594,
+        42.69595666984776,
+        44.99354156225523,
+    ],
+    bgr_to_rgb=False,
+)
+
+model = dict(
+    backbone=dict(
+        type='MAE',
+        img_size=(512, 512),
+        init_values=1.0,
+        # out_indices=[3, 5, 7, 11]
+    ),
+    data_preprocessor=dict(
+        size=(512, 512),
+        mean=[
+            94.94709810464303,
+            61.72942233949928,
+            75.93763705236906,
+        ],
+        std=[
+            44.005506081132594,
+            42.69595666984776,
+            44.99354156225523,
+        ],
+        bgr_to_rgb=False,
+    ),
+    pretrained='./My_Local_Model/pretrain/mae_pretrain_vit_base_mmcls.pth',
+    decode_head=dict(
+        in_channels=[
+            768,
+            768,
+            768,
+            768,
+        ],
+        channels=768,
+        num_classes=36,
+        norm_cfg=dict(
+            type='BN',
+        ),
+        loss_decode=dict(
+            type='DiceLoss',
+            use_sigmoid=False,
+            loss_weight=1.0,
+        ),
+    ),
+    neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]), #  out_indices=[3, 5, 7, 11]
+    auxiliary_head=dict(
+        in_channels=768,
+        norm_cfg=dict(
+            type='BN',
+        ),
+        num_classes=36,
+        loss_decode=dict(
+            type='DiceLoss',
+            use_sigmoid=False,
+            loss_weight=0.4,
+        ),
+    ),
+    test_cfg=dict(
+        mode='slide',
+        crop_size=(512, 512),
+        stride=(341, 341),
+    ),
+)
+
+fp16 = dict(
+    loss_scale='dynamic',
+)
+
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=3e-05,
+        betas=(0.9, 0.999),
+        weight_decay=0.05,
+    ),
+    constructor='LayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(
+        num_layers=12,
+        layer_decay_rate=0.9,
+    ),
+)
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-06,
+        by_epoch=False,
+        begin=0,
+        end=1500,
+    ),
+    dict(
+        type='PolyLR',
+        power=0.9,
+        begin=1500,
+        end=40000,
+        eta_min=1e-05,
+        by_epoch=False,
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=2,
+)
+
--- a/Seg_All_In_One_MMSeg/configs/mae/my_mae_upernet-base_b4_g1-40k_check_4000_my_dataset_model-512x512-testslide.py
+++ b/Seg_All_In_One_MMSeg/configs/mae/my_mae_upernet-base_b4_g1-40k_check_4000_my_dataset_model-512x512-testslide.py
@@ -0,0 +1,137 @@
+_base_ = [
+    '../_base_/models/upernet_mae.py',
+    '../_base_/datasets/my_dataset_model.py',
+    '../_base_/default_runtime.py',
+    '../_base_/schedules/schedule_40k_check_4000.py',
+]
+
+norm_cfg = dict(
+    type='BN',
+)
+
+crop_size = (512, 512)
+
+data_preprocessor = dict(
+    size=(512, 512),
+    mean=[
+        94.94709810464303,
+        61.72942233949928,
+        75.93763705236906,
+    ],
+    std=[
+        44.005506081132594,
+        42.69595666984776,
+        44.99354156225523,
+    ],
+    bgr_to_rgb=False,
+)
+
+model = dict(
+    backbone=dict(
+        type='MAE',
+        img_size=(512, 512),
+        init_values=1.0,
+    ),
+    data_preprocessor=dict(
+        size=(512, 512),
+        mean=[
+            94.94709810464303,
+            61.72942233949928,
+            75.93763705236906,
+        ],
+        std=[
+            44.005506081132594,
+            42.69595666984776,
+            44.99354156225523,
+        ],
+        bgr_to_rgb=False,
+    ),
+    pretrained='./My_Local_Model/pretrain/mae_pretrain_vit_base_mmcls.pth',
+    decode_head=dict(
+        in_channels=[
+            768,
+            768,
+            768,
+            768,
+        ],
+        channels=768,
+        num_classes=36,
+        norm_cfg=dict(
+            type='BN',
+        ),
+        loss_decode=dict(
+            type='DiceLoss',
+            use_sigmoid=False,
+            loss_weight=1.0,
+        ),
+    ),
+    neck=dict(
+        embed_dim=768,
+        rescales=[
+            4,
+            2,
+            1,
+            0.5,
+        ],
+    ),
+    auxiliary_head=dict(
+        in_channels=768,
+        norm_cfg=dict(
+            type='BN',
+        ),
+        num_classes=36,
+        loss_decode=dict(
+            type='DiceLoss',
+            use_sigmoid=False,
+            loss_weight=0.4,
+        ),
+    ),
+    test_cfg=dict(
+        mode='slide',
+        crop_size=(512, 512),
+        stride=(341, 341),
+    ),
+)
+
+fp16 = dict(
+    loss_scale='dynamic',
+)
+
+optim_wrapper = dict(
+    _delete_=True,
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW',
+        lr=3e-05,
+        betas=(0.9, 0.999),
+        weight_decay=0.05,
+    ),
+    constructor='LayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(
+        num_layers=12,
+        layer_decay_rate=0.9,
+    ),
+)
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=1e-06,
+        by_epoch=False,
+        begin=0,
+        end=1500,
+    ),
+    dict(
+        type='PolyLR',
+        power=0.9,
+        begin=1500,
+        end=40000,
+        eta_min=1e-05,
+        by_epoch=False,
+    ),
+]
+
+train_dataloader = dict(
+    batch_size=4,
+)
+