Skip to content

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

License

Notifications You must be signed in to change notification settings

stanford-futuredata/ColBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

e873781 · Apr 27, 2022

History

82 Commits
Dec 24, 2021
Apr 27, 2022
Feb 2, 2022
Oct 13, 2021
Oct 18, 2021
Oct 13, 2021
Feb 2, 2022
Mar 24, 2022
Oct 31, 2021
Apr 10, 2022
Oct 13, 2021

Repository files navigation

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:


Installation

ColBERT (currently: v2.0.2) requires Python 3.7+ and Pytorch 1.9+ and uses the HuggingFace Transformers library.

We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.)

conda env create -f conda_env.yml
conda activate colbert-v0.4.2

If you face any problems, please open a new issue and we'll help you promptly!

UPDATED 2022/02/02: API Usage Notebook

This Jupyter docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API.

It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.

CPU execution

We have included a new environment file specifically for CPU-only environments (conda_env_cpu.yml), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify CUDA_VISIBLE_DEVICES="" as part of your command.

About

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

Resources

License

Stars

Watchers

Forks