Support LVIS chunked evaluation and image chunked inference of GLIP (#…

…11136)
open-mmlab · Nov 9, 2023 · 51f8aee · 51f8aee
1 parent 4a516c3
commit 51f8aee
Show file tree

Hide file tree

Showing 11 changed files with 730 additions and 62 deletions.
diff --git a/configs/glip/README.md b/configs/glip/README.md
@@ -56,7 +56,7 @@ model.save_pretrained("your path/bert-base-uncased")
 tokenizer.save_pretrained("your path/bert-base-uncased")
 ```
 
-## Results and Models
+## COCO Results and Models
 
 |   Model    | Zero-shot or Finetune | COCO mAP | Official COCO mAP |       Pre-Train Data       |                                 Config                                  |                                                                                                                                                                                                   Download                                                                                                                                                                                                    |
 | :--------: | :-------------------: | :------: | ----------------: | :------------------------: | :---------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
@@ -78,3 +78,24 @@ Note:
 3. Taking the GLIP-T(A) model as an example, I trained it twice using the official code, and the fine-tuning mAP were 52.5 and 52.6. Therefore, the mAP we achieved in our reproduction is higher than the official results. The main reason is that we modified the `weight_decay` parameter.
 4. Our experiments revealed that training for 24 epochs leads to overfitting. Therefore, we chose the best-performing model. If users want to train on a custom dataset, it is advisable to shorten the number of epochs and save the best-performing model.
 5. Due to the official absence of fine-tuning hyperparameters for the GLIP-L model, we have not yet reproduced the official accuracy. I have found that overfitting can also occur, so it may be necessary to consider custom modifications to data augmentation and model enhancement. Given the high cost of training, we have not conducted any research on this matter at the moment.
+
+## LVIS Results
+
+|   Model    | Official | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP |       Pre-Train Data       |                                 Config                                  |                                           Download                                           |
+| :--------: | :------: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :--------: | :-------: | :------------------------: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
+| GLIP-T (A) |    ✔     |             |             |             |            |            |            |            |           |            O365            | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py)  | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
+| GLIP-T (A) |          |    12.1     |    15.5     |    25.8     |    20.2    |    6.2     |    10.9    |    22.8    |   14.7    |            O365            | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py)  | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
+| GLIP-T (B) |    ✔     |             |             |             |            |            |            |            |           |            O365            | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
+| GLIP-T (B) |          |     8.6     |    13.9     |    26.0     |    19.3    |    4.6     |    9.8     |    22.6    |   13.9    |            O365            | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
+| GLIP-T (C) |    ✔     |    14.3     |    19.4     |    31.1     |    24.6    |            |            |            |           |         O365,GoldG         | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
+| GLIP-T (C) |          |    14.4     |    19.8     |    31.9     |    25.2    |    8.3     |    13.2    |    28.1    |   18.2    |         O365,GoldG         | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
+|   GLIP-T   |    ✔     |             |             |             |            |            |            |            |           |    O365,GoldG,CC3M,SBU     | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) |  [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth)  |
+|   GLIP-T   |          |    18.1     |    21.2     |    33.1     |    26.7    |    10.8    |    14.7    |    29.0    |   19.6    |    O365,GoldG,CC3M,SBU     | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) |  [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth)  |
+|   GLIP-L   |    ✔     |    29.2     |    34.9     |    42.1     |    37.9    |            |            |            |           | FourODs,GoldG,CC3M+12M,SBU |  [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py)   |   [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth)    |
+|   GLIP-L   |          |    27.9     |    33.7     |    39.7     |    36.1    |            |            |            |           | FourODs,GoldG,CC3M+12M,SBU |  [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py)   |   [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth)    |
+
+Note:
+
+1. The above are zero-shot evaluation results.
+2. The evaluation metric we used is LVIS FixAP. For specific details, please refer to [Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details](https://arxiv.org/pdf/2102.01066.pdf).
+3. We found that the performance on small models is better than the official results, but it is lower on large models. This is mainly due to the incomplete alignment of the GLIP post-processing.
diff --git a/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py b/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py
@@ -0,0 +1,12 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'
+
+model = dict(
+    backbone=dict(
+        embed_dims=192,
+        depths=[2, 2, 18, 2],
+        num_heads=[6, 12, 24, 48],
+        window_size=12,
+        drop_path_rate=0.4,
+    ),
+    neck=dict(in_channels=[384, 768, 1536]),
+    bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
diff --git a/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_mini-lvis.py b/configs/glip/lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,12 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'
+
+model = dict(
+    backbone=dict(
+        embed_dims=192,
+        depths=[2, 2, 18, 2],
+        num_heads=[6, 12, 24, 48],
+        window_size=12,
+        drop_path_rate=0.4,
+    ),
+    neck=dict(in_channels=[384, 768, 1536]),
+    bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
diff --git a/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py b/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py
@@ -0,0 +1,24 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+model = dict(test_cfg=dict(
+    max_per_img=300,
+    chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+    dataset=dict(
+        data_root=data_root,
+        type=dataset_type,
+        ann_file='annotations/lvis_od_val.json',
+        data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+    _delete_=True,
+    type='LVISFixedAPMetric',
+    ann_file=data_root + 'annotations/lvis_od_val.json')
+test_evaluator = val_evaluator
diff --git a/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py b/configs/glip/lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,25 @@
+_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'
+
+model = dict(test_cfg=dict(
+    max_per_img=300,
+    chunked_size=40,
+))
+
+dataset_type = 'LVISV1Dataset'
+data_root = 'data/coco/'
+
+val_dataloader = dict(
+    dataset=dict(
+        data_root=data_root,
+        type=dataset_type,
+        ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
+        data_prefix=dict(img='')))
+test_dataloader = val_dataloader
+
+# numpy < 1.24.0
+val_evaluator = dict(
+    _delete_=True,
+    type='LVISFixedAPMetric',
+    ann_file=data_root +
+    'annotations/lvis_v1_minival_inserted_image_name.json')
+test_evaluator = val_evaluator
diff --git a/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py b/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py
@@ -0,0 +1,3 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'
+
+model = dict(bbox_head=dict(early_fuse=True))
diff --git a/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_mini-lvis.py b/configs/glip/lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_mini-lvis.py
@@ -0,0 +1,3 @@
+_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'
+
+model = dict(bbox_head=dict(early_fuse=True))
diff --git a/demo/image_demo.py b/demo/image_demo.py
@@ -28,6 +28,16 @@
         glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
         --texts 'There are a lot of cars here.'
 
+        python demo/image_demo.py demo/demo.jpg \
+        glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
+        --texts '$: coco'
+
+        python demo/image_demo.py demo/demo.jpg \
+        glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
+        --texts '$: lvis' --pred-score-thr 0.7 \
+        --palette random --chunked-size 80
+
+
     Visualize prediction results::
 
         python demo/image_demo.py demo/demo.jpg rtmdet-ins-s --show
@@ -41,6 +51,7 @@
 from mmengine.logging import print_log
 
 from mmdet.apis import DetInferencer
+from mmdet.evaluation import get_classes
 
 
 def parse_args():
@@ -60,7 +71,12 @@ def parse_args():
         type=str,
         default='outputs',
         help='Output directory of images or prediction results.')
-    parser.add_argument('--texts', help='text prompt')
+    # Once you input a format similar to $: xxx, it indicates that
+    # the prompt is based on the dataset class name.
+    # support $: coco, $: voc, $: cityscapes, $: lvis, $: imagenet_det.
+    # detail to `mmdet/evaluation/functional/class_names.py`
+    parser.add_argument(
+        '--texts', help='text prompt, such as "bench . car .", "$: coco"')
     parser.add_argument(
         '--device', default='cuda:0', help='Device used for inference')
     parser.add_argument(
@@ -91,14 +107,21 @@ def parse_args():
         default='none',
         choices=['coco', 'voc', 'citys', 'random', 'none'],
         help='Color palette used for visualization')
-    # only for GLIP
+    # only for GLIP and Grounding DINO
     parser.add_argument(
         '--custom-entities',
         '-c',
         action='store_true',
         help='Whether to customize entity names? '
         'If so, the input text should be '
         '"cls_name1 . cls_name2 . cls_name3 ." format')
+    parser.add_argument(
+        '--chunked-size',
+        '-s',
+        type=int,
+        default=-1,
+        help='If the number of categories is very large, '
+        'you can specify this parameter to truncate multiple predictions.')
 
     call_args = vars(parser.parse_args())
 
@@ -111,6 +134,12 @@ def parse_args():
         call_args['weights'] = call_args['model']
         call_args['model'] = None
 
+    if call_args['texts'] is not None:
+        if call_args['texts'].startswith('$:'):
+            dataset_name = call_args['texts'][3:].strip()
+            class_names = get_classes(dataset_name)
+            call_args['texts'] = [tuple(class_names)]
+
     init_kws = ['model', 'weights', 'device', 'palette']
     init_args = {}
     for init_kw in init_kws:
@@ -125,6 +154,10 @@ def main():
     #  may consume too much memory if your input folder has a lot of images.
     #  We will be optimized later.
     inferencer = DetInferencer(**init_args)
+
+    chunked_size = call_args.pop('chunked_size')
+    inferencer.model.test_cfg.chunked_size = chunked_size
+
     inferencer(**call_args)
 
     if call_args['out_dir'] != '' and not (call_args['no_save_vis']