Skip to content

Commit

Permalink
Support LVIS chunked evaluation and image chunked inference of GLIP (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
hhaAndroid committed Nov 9, 2023
1 parent 4a516c3 commit 51f8aee
Show file tree
Hide file tree
Showing 11 changed files with 730 additions and 62 deletions.
23 changes: 22 additions & 1 deletion configs/glip/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ model.save_pretrained("your path/bert-base-uncased")
tokenizer.save_pretrained("your path/bert-base-uncased")
```

## Results and Models
## COCO Results and Models

| Model | Zero-shot or Finetune | COCO mAP | Official COCO mAP | Pre-Train Data | Config | Download |
| :--------: | :-------------------: | :------: | ----------------: | :------------------------: | :---------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
Expand All @@ -78,3 +78,24 @@ Note:
3. Taking the GLIP-T(A) model as an example, I trained it twice using the official code, and the fine-tuning mAP were 52.5 and 52.6. Therefore, the mAP we achieved in our reproduction is higher than the official results. The main reason is that we modified the `weight_decay` parameter.
4. Our experiments revealed that training for 24 epochs leads to overfitting. Therefore, we chose the best-performing model. If users want to train on a custom dataset, it is advisable to shorten the number of epochs and save the best-performing model.
5. Due to the official absence of fine-tuning hyperparameters for the GLIP-L model, we have not yet reproduced the official accuracy. I have found that overfitting can also occur, so it may be necessary to consider custom modifications to data augmentation and model enhancement. Given the high cost of training, we have not conducted any research on this matter at the moment.

## LVIS Results

| Model | Official | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | Pre-Train Data | Config | Download |
| :--------: | :------: | :---------: | :---------: | :---------: | :--------: | :--------: | :--------: | :--------: | :-------: | :------------------------: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: |
| GLIP-T (A) || | | | | | | | | O365 | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
| GLIP-T (A) | | 12.1 | 15.5 | 25.8 | 20.2 | 6.2 | 10.9 | 22.8 | 14.7 | O365 | [config](lvis/glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_a_mmdet-b3654169.pth) |
| GLIP-T (B) || | | | | | | | | O365 | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
| GLIP-T (B) | | 8.6 | 13.9 | 26.0 | 19.3 | 4.6 | 9.8 | 22.6 | 13.9 | O365 | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_b_mmdet-6dfbd102.pth) |
| GLIP-T (C) || 14.3 | 19.4 | 31.1 | 24.6 | | | | | O365,GoldG | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
| GLIP-T (C) | | 14.4 | 19.8 | 31.9 | 25.2 | 8.3 | 13.2 | 28.1 | 18.2 | O365,GoldG | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_c_mmdet-2fc427dd.pth) |
| GLIP-T || | | | | | | | | O365,GoldG,CC3M,SBU | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth) |
| GLIP-T | | 18.1 | 21.2 | 33.1 | 26.7 | 10.8 | 14.7 | 29.0 | 19.6 | O365,GoldG,CC3M,SBU | [config](lvis/glip_atss_swin-t_bc_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_tiny_mmdet-c24ce662.pth) |
| GLIP-L || 29.2 | 34.9 | 42.1 | 37.9 | | | | | FourODs,GoldG,CC3M+12M,SBU | [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth) |
| GLIP-L | | 27.9 | 33.7 | 39.7 | 36.1 | | | | | FourODs,GoldG,CC3M+12M,SBU | [config](lvis/glip_atss_swin-l_fpn_dyhead_pretrain_zeroshot_lvis.py) | [model](https://download.openmmlab.com/mmdetection/v3.0/glip/glip_l_mmdet-abfe026b.pth) |

Note:

1. The above are zero-shot evaluation results.
2. The evaluation metric we used is LVIS FixAP. For specific details, please refer to [Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details](https://arxiv.org/pdf/2102.01066.pdf).
3. We found that the performance on small models is better than the official results, but it is lower on large models. This is mainly due to the incomplete alignment of the GLIP post-processing.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'

model = dict(
backbone=dict(
embed_dims=192,
depths=[2, 2, 18, 2],
num_heads=[6, 12, 24, 48],
window_size=12,
drop_path_rate=0.4,
),
neck=dict(in_channels=[384, 768, 1536]),
bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'

model = dict(
backbone=dict(
embed_dims=192,
depths=[2, 2, 18, 2],
num_heads=[6, 12, 24, 48],
window_size=12,
drop_path_rate=0.4,
),
neck=dict(in_channels=[384, 768, 1536]),
bbox_head=dict(early_fuse=True, num_dyhead_blocks=8))
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'

model = dict(test_cfg=dict(
max_per_img=300,
chunked_size=40,
))

dataset_type = 'LVISV1Dataset'
data_root = 'data/coco/'

val_dataloader = dict(
dataset=dict(
data_root=data_root,
type=dataset_type,
ann_file='annotations/lvis_od_val.json',
data_prefix=dict(img='')))
test_dataloader = val_dataloader

# numpy < 1.24.0
val_evaluator = dict(
_delete_=True,
type='LVISFixedAPMetric',
ann_file=data_root + 'annotations/lvis_od_val.json')
test_evaluator = val_evaluator
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
_base_ = '../glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365.py'

model = dict(test_cfg=dict(
max_per_img=300,
chunked_size=40,
))

dataset_type = 'LVISV1Dataset'
data_root = 'data/coco/'

val_dataloader = dict(
dataset=dict(
data_root=data_root,
type=dataset_type,
ann_file='annotations/lvis_v1_minival_inserted_image_name.json',
data_prefix=dict(img='')))
test_dataloader = val_dataloader

# numpy < 1.24.0
val_evaluator = dict(
_delete_=True,
type='LVISFixedAPMetric',
ann_file=data_root +
'annotations/lvis_v1_minival_inserted_image_name.json')
test_evaluator = val_evaluator
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_lvis.py'

model = dict(bbox_head=dict(early_fuse=True))
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
_base_ = './glip_atss_swin-t_a_fpn_dyhead_pretrain_zeroshot_mini-lvis.py'

model = dict(bbox_head=dict(early_fuse=True))
37 changes: 35 additions & 2 deletions demo/image_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,16 @@
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts 'There are a lot of cars here.'
python demo/image_demo.py demo/demo.jpg \
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts '$: coco'
python demo/image_demo.py demo/demo.jpg \
glip_atss_swin-t_a_fpn_dyhead_pretrain_obj365 \
--texts '$: lvis' --pred-score-thr 0.7 \
--palette random --chunked-size 80
Visualize prediction results::
python demo/image_demo.py demo/demo.jpg rtmdet-ins-s --show
Expand All @@ -41,6 +51,7 @@
from mmengine.logging import print_log

from mmdet.apis import DetInferencer
from mmdet.evaluation import get_classes


def parse_args():
Expand All @@ -60,7 +71,12 @@ def parse_args():
type=str,
default='outputs',
help='Output directory of images or prediction results.')
parser.add_argument('--texts', help='text prompt')
# Once you input a format similar to $: xxx, it indicates that
# the prompt is based on the dataset class name.
# support $: coco, $: voc, $: cityscapes, $: lvis, $: imagenet_det.
# detail to `mmdet/evaluation/functional/class_names.py`
parser.add_argument(
'--texts', help='text prompt, such as "bench . car .", "$: coco"')
parser.add_argument(
'--device', default='cuda:0', help='Device used for inference')
parser.add_argument(
Expand Down Expand Up @@ -91,14 +107,21 @@ def parse_args():
default='none',
choices=['coco', 'voc', 'citys', 'random', 'none'],
help='Color palette used for visualization')
# only for GLIP
# only for GLIP and Grounding DINO
parser.add_argument(
'--custom-entities',
'-c',
action='store_true',
help='Whether to customize entity names? '
'If so, the input text should be '
'"cls_name1 . cls_name2 . cls_name3 ." format')
parser.add_argument(
'--chunked-size',
'-s',
type=int,
default=-1,
help='If the number of categories is very large, '
'you can specify this parameter to truncate multiple predictions.')

call_args = vars(parser.parse_args())

Expand All @@ -111,6 +134,12 @@ def parse_args():
call_args['weights'] = call_args['model']
call_args['model'] = None

if call_args['texts'] is not None:
if call_args['texts'].startswith('$:'):
dataset_name = call_args['texts'][3:].strip()
class_names = get_classes(dataset_name)
call_args['texts'] = [tuple(class_names)]

init_kws = ['model', 'weights', 'device', 'palette']
init_args = {}
for init_kw in init_kws:
Expand All @@ -125,6 +154,10 @@ def main():
# may consume too much memory if your input folder has a lot of images.
# We will be optimized later.
inferencer = DetInferencer(**init_args)

chunked_size = call_args.pop('chunked_size')
inferencer.model.test_cfg.chunked_size = chunked_size

inferencer(**call_args)

if call_args['out_dir'] != '' and not (call_args['no_save_vis']
Expand Down
Loading

0 comments on commit 51f8aee

Please sign in to comment.