This branch is 1 commit ahead of, 271 commits behind main.

Name	Name	Last commit message	Last commit date
Latest commit santhnm2 WIP index on SSD Apr 27, 2022 e873781 · Apr 27, 2022 History 82 Commits
baleen	baleen	Minor updates to Baleen	Dec 24, 2021
colbert	colbert	WIP index on SSD	Apr 27, 2022
docs	docs	Release ColBERTv2 checkpoint, LoTTE data, updated notebook	Feb 2, 2022
utility	utility	Initial commit with the new API and residual compression	Oct 13, 2021
.gitignore	.gitignore	Add colbert/data	Oct 18, 2021
LICENSE	LICENSE	Initial commit with the new API and residual compression	Oct 13, 2021
LoTTE.md	LoTTE.md	Release ColBERTv2 checkpoint, LoTTE data, updated notebook	Feb 2, 2022
README.md	README.md	Update README.md	Mar 24, 2022
conda_env.yml	conda_env.yml	Update to version 0.4.8	Oct 31, 2021
conda_env_cpu.yml	conda_env_cpu.yml	Make FAISS optional for search, update to Python 3.8 on CPU	Apr 10, 2022
setup.py	setup.py	Initial commit with the new API and residual compression	Oct 13, 2021

Repository files navigation

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR'20).
Relevance-guided Supervision for OpenQA with ColBERT (TACL'21).
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21).
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (preprint).

Installation

ColBERT (currently: v2.0.2) requires Python 3.7+ and Pytorch 1.9+ and uses the HuggingFace Transformers library.

We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.)

conda env create -f conda_env.yml
conda activate colbert-v0.4.2

If you face any problems, please open a new issue and we'll help you promptly!

UPDATED 2022/02/02: API Usage Notebook

This Jupyter docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API.

It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.

CPU execution

We have included a new environment file specifically for CPU-only environments (conda_env_cpu.yml), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify CUDA_VISIBLE_DEVICES="" as part of your command.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Installation

UPDATED 2022/02/02: API Usage Notebook

CPU execution

About

Used by 556

Contributors 32

Languages

License

stanford-futuredata/ColBERT

Folders and files

Latest commit

History

Repository files navigation

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Installation

UPDATED 2022/02/02: API Usage Notebook

CPU execution

About

Resources

License

Stars

Watchers

Forks

Used by 556

Contributors 32

Languages