|
8 | 8 | <br>
|
9 | 9 |
|
10 | 10 | <p align="center">
|
11 |
| - <a href="https://www.modelscope.cn/models?name=clip&tasks=multi-modal-embedding">ModelScope</a>  |  <a href="https://www.modelscope.cn/studios/damo/chinese_clip_applications/summary">Demo</a>  |  <a href="https://arxiv.org/abs/2211.01335">Paper </a>  |  Blog |
| 11 | + <a href="https://www.modelscope.cn/models?name=clip&tasks=multi-modal-embedding">ModelScope</a> | <a href="https://www.modelscope.cn/studios/damo/chinese_clip_applications/summary">Demo</a> |  <a href="https://arxiv.org/abs/2211.01335">Paper</a> | <a href="https://qwenlm.github.io/blog/chinese-clip/">Blog</a> |
12 | 12 | </p>
|
13 | 13 | <br><br>
|
14 | 14 |
|
@@ -39,25 +39,27 @@ Currently, we release 5 different sizes of Chinese-CLIP models. Detailed informa
|
39 | 39 |
|
40 | 40 | <table border="1" width="100%">
|
41 | 41 | <tr align="center">
|
42 |
| - <th>Model</th><th>Ckpt</th><th>#Params (All)</th><th>Backbone (I)</th><th>#Params (I)</th><th>Backbone (T)</th><th>#Params (T)</th><th>Resolution</th> |
| 42 | + <th>Model ID</th><th>Model</th><th>Ckpt</th><th>#Params (All)</th><th>Backbone (I)</th><th>#Params (I)</th><th>Backbone (T)</th><th>#Params (T)</th><th>Resolution</th> |
43 | 43 | </tr>
|
44 | 44 | <tr align="center">
|
45 |
| - <td>CN-CLIP<sub>RN50</sub></td><td><a href="https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_rn50.pt">Download</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td> |
| 45 | + <td>chinese-clip-rn50</td><td>CN-CLIP<sub>RN50</sub></td><td><a href="https://huggingface.co/OFA-Sys/chinese-clip-rn50">🤗</a> <a href="https://www.modelscope.cn/models/AI-ModelScope/chinese-clip-rn50">🤖</a></td><td>77M</td><td>ResNet50</td><td>38M</td><td>RBT3</td><td>39M</td><td>224</td> |
46 | 46 | </tr>
|
47 | 47 | <tr align="center">
|
48 |
| - <td>CN-CLIP<sub>ViT-B/16</sub></td><td><a href="https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-b-16.pt">Download</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td> |
| 48 | + <td>chinese-clip-vit-base-patch16</td><td>CN-CLIP<sub>ViT-B/16</sub></td><td><a href="https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16">🤗</a> <a href="https://www.modelscope.cn/models/AI-ModelScope/chinese-clip-vit-base-patch16">🤖</a></td><td>188M</td><td>ViT-B/16</td><td>86M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td> |
49 | 49 | </tr>
|
50 | 50 | <tr align="center">
|
51 |
| - <td>CN-CLIP<sub>ViT-L/14</sub></td><td><a href="https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-l-14.pt">Download</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td> |
| 51 | + <td>chinese-clip-vit-large-patch14</td><td>CN-CLIP<sub>ViT-L/14</sub></td><td><a href="https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14">🤗</a> <a href="https://www.modelscope.cn/models/AI-ModelScope/chinese-clip-vit-large-patch14">🤖</a></td><td>406M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>224</td> |
52 | 52 | </tr>
|
53 | 53 | <tr align="center">
|
54 |
| - <td>CN-CLIP<sub>ViT-L/14@336px</sub></td><td><a href="https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-l-14-336.pt">Download</a></td><td>407M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>336</td> |
| 54 | + <td>chinese-clip-vit-large-patch14-336px</td><td>CN-CLIP<sub>ViT-L/14@336px</sub></td><td><a href="https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px">🤗</a> <a href="https://www.modelscope.cn/models/AI-ModelScope/chinese-clip-vit-large-patch14-336px">🤖</a></td><td>407M</td><td>ViT-L/14</td><td>304M</td><td>RoBERTa-wwm-Base</td><td>102M</td><td>336</td> |
55 | 55 | </tr>
|
56 | 56 | <tr align="center">
|
57 |
| - <td>CN-CLIP<sub>ViT-H/14</sub></td><td><a href="https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/clip_cn_vit-h-14.pt">Download</a></td><td>958M</td><td>ViT-H/14</td><td>632M</td><td>RoBERTa-wwm-Large</td><td>326M</td><td>224</td> |
| 57 | + <td>chinese-clip-vit-huge-patch14</td><td>CN-CLIP<sub>ViT-H/14</sub></td><td><a href="https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14">🤗</a> <a href="https://www.modelscope.cn/models/AI-ModelScope/chinese-clip-vit-huge-patch14">🤖</a></td><td>958M</td><td>ViT-H/14</td><td>632M</td><td>RoBERTa-wwm-Large</td><td>326M</td><td>224</td> |
58 | 58 | </tr>
|
59 | 59 | </table>
|
60 |
| -<br></br> |
| 60 | + |
| 61 | +- 🤗 Hugging Face Hub |
| 62 | +- 🤖 ModelScope |
61 | 63 |
|
62 | 64 | ## Results
|
63 | 65 | We conducted zero-shot inference and finetuning experiments on [MUGE Retrieval](https://tianchi.aliyun.com/muge), [Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap) and [COCO-CN](https://github.com/li-xirong/coco-cn) for the evaluation of cross-modal retrieval, and conducted experiments on 10 image classification datasets of the [ELEVATER](https://eval.ai/web/challenges/challenge-page/1832) benchmark for the evaluation of zero-shot image classification. Results are shown below. Due to space limitation, here we only list the performance of the best performing Chinese-CLIP and baseline models. For detailed performance of each Chinese-CLIP model size, please refer to [Results.md](Results.md).
|
@@ -192,6 +194,8 @@ print("Available models:", available_models())
|
192 | 194 | # Available models: ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
|
193 | 195 |
|
194 | 196 | device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 197 | +# If the model checkpoint does not exist in local drives, it will download the checkpoint from Hugging Face Hub automatically. |
| 198 | +# Requires `huggingface_hub` to be installed. |
195 | 199 | model, preprocess = load_from_name("ViT-B-16", device=device, download_root='./')
|
196 | 200 | model.eval()
|
197 | 201 | image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).to(device)
|
@@ -315,7 +319,7 @@ ${DATAPATH}
|
315 | 319 | └── test
|
316 | 320 | ```
|
317 | 321 |
|
318 |
| -For easier use, we have provided preprocessed MUGE ([download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/MUGE.zip)) and Flickr30K-CN ([download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/Flickr30k-CN.zip)) datasets in zip format. To use them, just download and unzip it under `${DATAPATH}/datasets/`. If you need [COCO-CN](https://github.com/li-xirong/coco-cn) dataset, please contact us by email when you have finished applying for permission from the original author. |
| 322 | +For easier use, we have provided preprocessed MUGE ([🤗download link](https://huggingface.co/datasets/OFA-Sys/chinese-clip-eval/resolve/main/MUGE.zip)) and Flickr30K-CN ([🤗download link](https://huggingface.co/datasets/OFA-Sys/chinese-clip-eval/resolve/main/Flickr30k-CN.zip)) datasets in zip format. To use them, just download and unzip it under `${DATAPATH}/datasets/`. If you need [COCO-CN](https://github.com/li-xirong/coco-cn) dataset, please contact us by email when you have finished applying for permission from the original author. |
319 | 323 |
|
320 | 324 | ### Finetuning
|
321 | 325 |
|
@@ -481,7 +485,7 @@ The printed results are shown below:
|
481 | 485 | {"success": true, "score": 85.67, "scoreJson": {"score": 85.67, "mean_recall": 85.67, "r1": 71.2, "r5": 90.5, "r10": 95.3}}
|
482 | 486 | ```
|
483 | 487 |
|
484 |
| -For better understanding of cross-modal retrieval by Chinese-CLIP, we also provide a runnable jupyter notebook ([download link](https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/others/Chinese-CLIP-on-MUGE-Retrieval.ipynb)), which works with the MUGE retrieval dataset (corresponding leaderboard is hosted on [Tianchi](https://tianchi.aliyun.com/competition/entrance/532031/introduction?lang=en-us)) and includes the finetuning and inference process mentioned above. Welcome to try! |
| 488 | +For better understanding of cross-modal retrieval by Chinese-CLIP, we also provide a runnable jupyter notebook ([download link](Chinese-CLIP-on-MUGE-Retrieval.ipynb)), which works with the MUGE retrieval dataset (corresponding leaderboard is hosted on [Tianchi](https://tianchi.aliyun.com/competition/entrance/532031/introduction?lang=en-us)) and includes the finetuning and inference process mentioned above. Welcome to try! |
485 | 489 |
|
486 | 490 | <br>
|
487 | 491 |
|
@@ -520,7 +524,7 @@ airplane
|
520 | 524 | anchor
|
521 | 525 | ...
|
522 | 526 | ```
|
523 |
| -The label id is `[line number]-1`. For example, the label id for the first line is 0, and the one for the second line is 1. If the number of labels is larger than 10, all labels are filled with 0 by the left to 3-digit numbers. For example, if the number of labels is 100, the ids are `000-099`. Users should create a directory for each label, and put the corresponding samples into the directories. We provide the processed dataset CIFAR-100 as an example, and please click [this link](http://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/datasets/cifar-100.zip) to download the prepared dataset. To evaluate other datasets of ELEVATER, please refer to [Notes for datasets](zeroshot_dataset_en.md) for download. |
| 527 | +The label id is `[line number]-1`. For example, the label id for the first line is 0, and the one for the second line is 1. If the number of labels is larger than 10, all labels are filled with 0 by the left to 3-digit numbers. For example, if the number of labels is 100, the ids are `000-099`. Users should create a directory for each label, and put the corresponding samples into the directories. We provide the processed dataset CIFAR-100 as an example, and please visit dataset [🤗 OFA-Sys/chinese-clip-eval](https://huggingface.co/datasets/OFA-Sys/chinese-clip-eval) or click [🤗 link](https://huggingface.co/datasets/OFA-Sys/chinese-clip-eval/resolve/main/cifar-100.zip) to download the prepared dataset. To evaluate other datasets of ELEVATER, please refer to [Notes for datasets](zeroshot_dataset_en.md) for download. |
524 | 528 | <br><br>
|
525 | 529 |
|
526 | 530 | ### Prediction and Evaluation
|
|
0 commit comments