Skip to content
Permalink

Comparing changes

This is a direct comparison between two commits made in this repository or its related repositories. View the default comparison for this range or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: stanford-futuredata/ColBERT
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: fb29d151ec1d70b7929ea4a1e74f66d9b5666ccd
Choose a base ref
..
head repository: stanford-futuredata/ColBERT
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: cf44222136a221ce638ddc4069433858a3b85afd
Choose a head ref
Showing with 24,742 additions and 2,212 deletions.
  1. +3 −0 .env
  2. +12 −4 .gitignore
  3. +94 −0 LoTTE.md
  4. +107 −82 README.md
  5. +141 −0 baleen/condenser/condense.py
  6. +79 −0 baleen/condenser/model.py
  7. +118 −0 baleen/condenser/tokenization.py
  8. +58 −0 baleen/engine.py
  9. +40 −0 baleen/hop_searcher.py
  10. +37 −0 baleen/utils/annotate.py
  11. +50 −0 baleen/utils/loaders.py
  12. +6 −0 colbert/__init__.py
  13. +5 −0 colbert/data/__init__.py
  14. +100 −0 colbert/data/collection.py
  15. +14 −0 colbert/data/dataset.py
  16. +82 −0 colbert/data/examples.py
  17. +163 −0 colbert/data/queries.py
  18. +94 −0 colbert/data/ranking.py
  19. +52 −0 colbert/distillation/ranking_scorer.py
  20. +68 −0 colbert/distillation/scorer.py
  21. +5 −3 colbert/evaluation/loaders.py
  22. +0 −88 colbert/evaluation/ranking.py
  23. +0 −57 colbert/evaluation/ranking_logger.py
  24. +0 −21 colbert/evaluation/slow.py
  25. +9 −51 colbert/index.py
  26. +0 −43 colbert/index_faiss.py
  27. +516 −0 colbert/index_updater.py
  28. +85 −0 colbert/indexer.py
  29. 0 colbert/indexing/codecs/__init__.py
  30. +23 −0 colbert/indexing/codecs/decompress_residuals.cpp
  31. +75 −0 colbert/indexing/codecs/decompress_residuals.cu
  32. +12 −0 colbert/indexing/codecs/packbits.cpp
  33. +57 −0 colbert/indexing/codecs/packbits.cu
  34. +276 −0 colbert/indexing/codecs/residual.py
  35. +95 −0 colbert/indexing/codecs/residual_embeddings.py
  36. +41 −0 colbert/indexing/codecs/residual_embeddings_strided.py
  37. +45 −0 colbert/indexing/collection_encoder.py
  38. +504 −0 colbert/indexing/collection_indexer.py
  39. +0 −187 colbert/indexing/encoder.py
  40. +0 −116 colbert/indexing/faiss.py
  41. +0 −58 colbert/indexing/faiss_index.py
  42. +0 −138 colbert/indexing/faiss_index_gpu.py
  43. +18 −2 colbert/indexing/index_manager.py
  44. +90 −0 colbert/indexing/index_saver.py
  45. +39 −7 colbert/indexing/loaders.py
  46. +54 −0 colbert/indexing/utils.py
  47. +2 −0 colbert/infra/__init__.py
  48. +2 −0 colbert/infra/config/__init__.py
  49. +105 −0 colbert/infra/config/base_config.py
  50. +15 −0 colbert/infra/config/config.py
  51. +86 −0 colbert/infra/config/core_config.py
  52. +174 −0 colbert/infra/config/settings.py
  53. +147 −0 colbert/infra/launcher.py
  54. +43 −0 colbert/infra/provenance.py
  55. +92 −0 colbert/infra/run.py
  56. +115 −0 colbert/infra/utilities/annotate_em.py
  57. +52 −0 colbert/infra/utilities/create_triples.py
  58. +64 −0 colbert/infra/utilities/minicorpus.py
  59. +116 −0 colbert/modeling/base_colbert.py
  60. +146 −0 colbert/modeling/checkpoint.py
  61. +169 −36 colbert/modeling/colbert.py
  62. +144 −0 colbert/modeling/hf_colbert.py
  63. +0 −87 colbert/modeling/inference.py
  64. 0 colbert/modeling/reranker/__init__.py
  65. +35 −0 colbert/modeling/reranker/electra.py
  66. +15 −0 colbert/modeling/reranker/tokenizer.py
  67. +97 −0 colbert/modeling/segmented_maxsim.cpp
  68. +11 −7 colbert/modeling/tokenization/doc_tokenization.py
  69. +42 −9 colbert/modeling/tokenization/query_tokenization.py
  70. +31 −19 colbert/modeling/tokenization/utils.py
  71. +4 −1 colbert/parameters.py
  72. +0 −131 colbert/ranking/batch_reranking.py
  73. +0 −50 colbert/ranking/batch_retrieval.py
  74. +0 −122 colbert/ranking/faiss_index.py
  75. +0 −82 colbert/ranking/index_part.py
  76. +0 −164 colbert/ranking/index_ranker.py
  77. +0 −43 colbert/ranking/rankers.py
  78. +0 −61 colbert/ranking/reranking.py
  79. +0 −61 colbert/ranking/retrieval.py
  80. +0 −50 colbert/rerank.py
  81. +0 −56 colbert/retrieve.py
  82. 0 colbert/search/__init__.py
  83. +64 −0 colbert/search/candidate_generation.py
  84. +160 −0 colbert/search/decompress_residuals.cpp
  85. +169 −0 colbert/search/filter_pids.cpp
  86. +86 −0 colbert/search/index_loader.py
  87. +173 −0 colbert/search/index_storage.py
  88. +148 −0 colbert/search/segmented_lookup.cpp
  89. +219 −0 colbert/search/strided_tensor.py
  90. +130 −0 colbert/search/strided_tensor_core.py
  91. +110 −0 colbert/searcher.py
  92. +0 −49 colbert/test.py
  93. +95 −0 colbert/tests/e2e_test.py
  94. +199 −0 colbert/tests/index_updater_test.py
  95. +0 −34 colbert/train.py
  96. +36 −0 colbert/trainer.py
  97. +41 −69 colbert/training/lazy_batcher.py
  98. +75 −0 colbert/training/rerank_batcher.py
  99. +108 −73 colbert/training/training.py
  100. +38 −11 colbert/training/utils.py
  101. +113 −0 colbert/utilities/annotate_em.py
  102. +65 −0 colbert/utilities/create_triples.py
  103. +66 −0 colbert/utilities/minicorpus.py
  104. +7 −9 colbert/utils/amp.py
  105. +18 −6 colbert/utils/distributed.py
  106. +31 −30 colbert/utils/logging.py
  107. +6 −1 colbert/utils/parser.py
  108. +2 −2 colbert/utils/runs.py
  109. +57 −18 colbert/utils/utils.py
  110. +23 −10 conda_env.yml
  111. +22 −0 conda_env_cpu.yml
  112. +4 −0 docs/.buildinfo
  113. BIN docs/.doctrees/environment.pickle
  114. BIN docs/.doctrees/index.doctree
  115. +20 −0 docs/Makefile
  116. +321 −0 docs/doctools.js
  117. BIN docs/doctrees/environment.pickle
  118. BIN docs/doctrees/index.doctree
  119. BIN docs/doctrees/indexer.doctree
  120. BIN docs/doctrees/searcher.doctree
  121. BIN docs/doctrees/trainer.doctree
  122. +12 −0 docs/documentation_options.js
  123. +4 −0 docs/html/.buildinfo
  124. +19 −0 docs/html/_sources/index.rst.txt
  125. +14 −0 docs/html/_sources/indexer.rst.txt
  126. +14 −0 docs/html/_sources/searcher.rst.txt
  127. +12 −0 docs/html/_sources/trainer.rst.txt
  128. +701 −0 docs/html/_static/alabaster.css
  129. +856 −0 docs/html/_static/basic.css
  130. +1 −0 docs/html/_static/custom.css
  131. +321 −0 docs/html/_static/doctools.js
  132. +12 −0 docs/html/_static/documentation_options.js
  133. BIN docs/html/_static/file.png
  134. +10,872 −0 docs/html/_static/jquery-3.5.1.js
  135. +2 −0 docs/html/_static/jquery.js
  136. +297 −0 docs/html/_static/language_data.js
  137. BIN docs/html/_static/minus.png
  138. BIN docs/html/_static/plus.png
  139. +77 −0 docs/html/_static/pygments.css
  140. +522 −0 docs/html/_static/searchtools.js
  141. +2,027 −0 docs/html/_static/underscore-1.12.0.js
  142. +6 −0 docs/html/_static/underscore.js
  143. +220 −0 docs/html/genindex.html
  144. +151 −0 docs/html/indexer.html
  145. BIN docs/html/objects.inv
  146. +115 −0 docs/html/search.html
  147. +149 −0 docs/html/searcher.html
  148. +1 −0 docs/html/searchindex.js
  149. +142 −0 docs/html/trainer.html
  150. +123 −0 docs/index.html
  151. +257 −0 docs/intro.ipynb
  152. +2 −0 docs/jquery.js
  153. +35 −0 docs/make.bat
  154. +65 −0 docs/source/conf.py
  155. +19 −0 docs/source/index.rst
  156. +14 −0 docs/source/indexer.rst
  157. +14 −0 docs/source/searcher.rst
  158. +12 −0 docs/source/trainer.rst
  159. +6 −0 docs/underscore.js
  160. +48 −0 server.py
  161. 0 utility/__init__.py
  162. 0 utility/evaluate/__init__.py
  163. +1 −1 utility/evaluate/msmarco_passages.py
  164. 0 utility/preprocess/__init__.py
  165. +0 −63 utility/preprocess/wikipedia_to_tsv.py
  166. 0 utility/rankings/__init__.py
  167. 0 utility/supervision/__init__.py
  168. 0 utility/utils/__init__.py
  169. +19 −0 utility/utils/save_metadata.py
3 changes: 3 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
INDEX_ROOT=""
INDEX_NAME=""
PORT="8893"
16 changes: 12 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
experiments/
checkpoints/
data/
logs/
/experiments/
/checkpoints/
/data/
/logs/
/mlruns/
/profiler/
/logs/

# Byte-compiled / optimized / DLL files
__pycache__/
@@ -10,6 +13,11 @@ __pycache__/

# Jupyter Notebook
.ipynb_checkpoints
# notebooks/

# mac
.DS_Store

# Other
.vscode
*.tsv
94 changes: 94 additions & 0 deletions LoTTE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
## LoTTE dataset

The <b>Lo</b>ng-<b>T</b>ail <b>T</b>opic-stratified <b>E</b>valuation (LoTTE) benchmark includes 12 domain-specific datasets derived from StackExchange questions and answers. Datasets span topics including writing, recreation, science, technology, and lifestyle. LoTTE includes two sets of queries: the first set is comprised of search-based queries from the GooAQ dataset, while the second set is comprised of forum-based queries taken directly from StackExchange.

The dataset can be downloaded from this link: [https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz](https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz)

The dataset is organized as follows:
```
|-- lotte
|-- writing
|-- dev
|-- collection.tsv
|-- metadata.jsonl
|-- questions.search.tsv
|-- qas.search.jsonl
|-- questions.forum.tsv
|-- qas.forum.jsonl
|-- test
|-- collection.tsv
|-- ...
|-- recreation
|-- ...
|-- ...
```
Here is a description of each file's contents:
- `collection.tsv`: A list of passages where each line is of the form by `[pid]\t[text]`
- `metadata.jsonl`: A list of JSON dictionaries for each question where each line is of the form:
```
{
"dataset": dataset,
"question_id": question_id,
"post_ids": [post_id_1, post_id_2, ..., post_id_n],
"scores": [score_1, score_2, ..., score_n],
"post_urls": [url_1, url_2, ..., url_n],
"post_authors": [author_1, author_2, ..., author_n],
"post_author_urls": [url_1, url_2, ..., url_n],
"question_author": question_author,
"question_author_url", question_author_url
}
```
- `questions.search.tsv`: A list of search-based questions of the form `[qid]\t[text]`
- `qas.search.jsonl`: A list of JSON dictionaries for each search-based question's answer data of the form:

```
{
"qid": qid,
"query": query,
"answer_pids": answer_pids
}
```
- `questions.forum.tsv`: A list of forum-based questions
- `qas.forum.tsv`: A list of JSON dictionaries for each forum-based question's answer data

We also include a script to evaluate LoTTE rankings: `evaluate_lotte_rankings.py`. Each rankings file must be in a tsv format with each line of the form `[qid]\t[pid]\t[rank]\t[score]`. Note that `qid`s must be in sequential order starting from 0, and `rank`s must be in sequential order starting from 1. The rankings directory must have the following structure:
```
|--rankings
|-- dev
|-- writing.search.ranking.tsv
|-- writing.forum.ranking.tsv
|-- recreation.search.ranking.tsv
|-- recreation.forum.ranking.tsv
|-- science.search.ranking.tsv
|-- science.forum.ranking.tsv
|-- technology.search.ranking.tsv
|-- technology.forum.ranking.tsv
|-- lifestyle.search.ranking.tsv
|-- lifestyle.forum.ranking.tsv
|-- pooled.search.ranking.tsv
|-- pooled.forum.ranking.tsv
|-- test
|-- writing.search.ranking.tsv
|-- ...
```
Note that the file names must match exactly, though if some files are missing the script will print partial results. An example usage of the script is as follows:
```
python evaluate_lotte_rankings.py --k 5 --split test --data_path /path/to/lotte --rankings_path /path/to/rankings
```
This will produce the following output (numbers taken from the ColBERTv2 evaluation):
```
[query_type=search, dataset=writing] Success@5: 80.1
[query_type=search, dataset=recreation] Success@5: 72.3
[query_type=search, dataset=science] Success@5: 56.7
[query_type=search, dataset=technology] Success@5: 66.1
[query_type=search, dataset=lifestyle] Success@5: 84.7
[query_type=search, dataset=pooled] Success@5: 71.6
[query_type=forum, dataset=writing] Success@5: 76.3
[query_type=forum, dataset=recreation] Success@5: 70.8
[query_type=forum, dataset=science] Success@5: 46.1
[query_type=forum, dataset=technology] Success@5: 53.6
[query_type=forum, dataset=lifestyle] Success@5: 76.9
[query_type=forum, dataset=pooled] Success@5: 63.4
```

189 changes: 107 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,12 @@
----
----

**Update: The branch [`new_api`](https://github.com/stanford-futuredata/ColBERT/tree/new_api) contains a new simpler API plus the code for the new [ColBERTv2](https://arxiv.org/abs/2112.01488) model, including a public checkpoint as well as a public release of our LoTTE benchmark.**

----
## 🚨 **Announcements**

----
* (1/29/23) We have merged a new index updater feature and support for additional Hugging Face models! These are in beta so please give us feedback as you try them out.
* (1/24/23) If you're looking for the **DSP** framework for composing ColBERTv2 and LLMs, it's at: https://github.com/stanfordnlp/dsp

# ColBERT (v2)

# ColBERT
### ColBERT is a _fast_ and _accurate_ retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

### ColBERT is a _fast_ and _accurate_ retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

<p align="center">
<img align="center" src="docs/images/ColBERT-Framework-MaxSim-W370px.png" />
@@ -25,144 +21,173 @@ These rich interactions allow ColBERT to surpass the quality of _single-vector_

* [**ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT**](https://arxiv.org/abs/2004.12832) (SIGIR'20).
* [**Relevance-guided Supervision for OpenQA with ColBERT**](https://arxiv.org/abs/2007.00814) (TACL'21).
* [**ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction**](https://arxiv.org/abs/2112.01488) (to appear at NAACL'22).
* [**Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval**](https://arxiv.org/abs/2101.00436) (NeurIPS'21).
* [**ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction**](https://arxiv.org/abs/2112.01488) (NAACL'22).
* [**PLAID: An Efficient Engine for Late Interaction Retrieval**](https://arxiv.org/abs/2205.09707) (CIKM'22).

----

## ColBERTv1

The ColBERTv1 code from the SIGIR'20 paper is in the [`colbertv1` branch](https://github.com/stanford-futuredata/ColBERT/tree/colbertv1). See [here](#branches) for more information on other branches.


## Installation

ColBERT (currently: [v0.2.0](#releases)) requires Python 3.7+ and Pytorch 1.6+ and uses the [HuggingFace Transformers](https://github.com/huggingface/transformers) library.
ColBERT requires Python 3.7+ and Pytorch 1.9+ and uses the [Hugging Face Transformers](https://github.com/huggingface/transformers) library.

We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official [conda installation guide](https://docs.anaconda.com/anaconda/install/linux/#installation).)

We strongly recommend creating a conda environment using:
We have also included a new environment file specifically for CPU-only environments (`conda_env_cpu.yml`), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify `CUDA_VISIBLE_DEVICES=""` as part of your command. Note that a GPU is required for training and indexing.

```
conda env create -f conda_env.yml
conda activate colbert-v0.2
conda env create -f conda_env[_cpu].yml
conda activate colbert
```

If you face any problems, please [open a new issue](https://github.com/stanford-futuredata/ColBERT/issues) and we'll help you promptly!



## Overview

Using ColBERT on a dataset typically involves the following steps.

**Step 0: Preprocess your collection.** At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection.

**Step 1: Train a ColBERT model.** You can [train your own ColBERT model](#training) and [validate performance](#validation) on a suitable development set.
**Step 1: Download the [pre-trained ColBERTv2 checkpoint](https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz).** This checkpoint has been trained on the MS MARCO Passage Ranking task. You can also _optionally_ [train your own ColBERT model](#training).

**Step 2: Index your collection.** Once you're happy with your ColBERT model, you need to [index your collection](#indexing) to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
**Step 2: Index your collection.** Once you have a trained ColBERT model, you need to [index your collection](#indexing) to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.

**Step 3: Search the collection with your queries.** Given your model and index, you can [issue queries over the collection](#retrieval) to retrieve the top-k passages for each query.
**Step 3: Search the collection with your queries.** Given the model and index, you can [issue queries over the collection](#retrieval) to retrieve the top-k passages for each query.

Below, we illustrate these steps via an example run on the MS MARCO Passage Ranking task.


## API Usage Notebook

This Jupyter notebook **[docs/intro.ipynb notebook](docs/intro.ipynb)** illustrates using the key features of ColBERT with the new Python API.

It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.


## Data

This repository works directly with a simple **tab-separated file** format to store queries, passages, and top-k ranked lists.


* Queries: each line is `qid \t query text`.
* Collection: each line is `pid \t passage text`.
* Collection: each line is `pid \t passage text`.
* Top-k Ranking: each line is `qid \t pid \t rank`.

This works directly with the data format of the [MS MARCO Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) dataset. You will need the training triples (`triples.train.small.tar.gz`), the official top-1000 ranked lists for the dev set queries (`top1000.dev`), and the dev set relevant passages (`qrels.dev.small.tsv`). For indexing the full collection, you will also need the list of passages (`collection.tar.gz`).


## Indexing

## Training

Training requires a list of _<query, positive passage, negative passage>_ tab-separated triples.

You can supply **full-text** triples, where each line is `query text \t positive passage text \t negative passage text`. Alternatively, you can supply the query and passage **IDs** as a JSONL file `[qid, pid+, pid-]` per line, in which case you should specify `--collection path/to/collection.tsv` and `--queries path/to/queries.train.tsv`.
For fast retrieval, indexing precomputes the ColBERT representations of passages.

Example usage:

```
CUDA_VISIBLE_DEVICES="0,1,2,3" \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.train --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 \
--triples /path/to/MSMARCO/triples.train.small.tsv \
--root /root/to/experiments/ --experiment MSMARCO-psg --similarity l2 --run msmarco.psg.l2
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
config = ColBERTConfig(
nbits=2,
root="/path/to/experiments",
)
indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")
```

You can use one or more GPUs by modifying `CUDA_VISIBLE_DEVICES` and `--nproc_per_node`.


## Validation

Before indexing into ColBERT, you can compare a few checkpoints by re-ranking a top-k set of documents per query. This will use ColBERT _on-the-fly_: it will compute document representations _during_ query evaluation.

This script requires the top-k list per query, provided as a tab-separated file whose every line contains a tuple `queryID \t passageID \t rank`, where rank is {1, 2, 3, ...} for each query. The script also accepts the format of MS MARCO's `top1000.dev` and `top1000.eval` and you can optionally supply relevance judgements (qrels) for evaluation. This is a tab-separated file whose every line has a quadruple _<query ID, 0, passage ID, 1>_, like `qrels.dev.small.tsv`.
## Retrieval

Example command:
We typically recommend that you use ColBERT for **end-to-end** retrieval, where it directly finds its top-k passages from the full collection:

```
python -m colbert.test --amp --doc_maxlen 180 --mask-punctuation \
--collection /path/to/MSMARCO/collection.tsv \
--queries /path/to/MSMARCO/queries.dev.small.tsv \
--topk /path/to/MSMARCO/top1000.dev \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--root /root/to/experiments/ --experiment MSMARCO-psg [--qrels path/to/qrels.dev.small.tsv]
from colbert.data import Queries
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher
if __name__=='__main__':
with Run().context(RunConfig(nranks=1, experiment="msmarco")):
config = ColBERTConfig(
root="/path/to/experiments",
)
searcher = Searcher(index="msmarco.nbits=2", config=config)
queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
ranking = searcher.search_all(queries, k=100)
ranking.save("msmarco.nbits=2.ranking.tsv")
```

You can optionally specify the `ncells`, `centroid_score_threshold`, and `ndocs` search hyperparameters to trade off between speed and result quality. Defaults for different values of `k` are listed in colbert/searcher.py.

## Indexing

For fast retrieval, indexing precomputes the ColBERT representations of passages.

Example command:
We can evaluate the MSMARCO rankings using the following command:

```
CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--collection /path/to/MSMARCO/collection.tsv \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--root /root/to/experiments/ --experiment MSMARCO-psg
python -m utility.evaluate.msmarco_passages --ranking "/path/to/msmarco.nbits=2.ranking.tsv" --qrels "/path/to/MSMARCO/qrels.dev.small.tsv"
```

The index created here allows you to re-rank the top-k passages retrieved by another method (e.g., BM25).

We typically recommend that you use ColBERT for **end-to-end** retrieval, where it directly finds its top-k passages from the full collection. For this, you need FAISS indexing.
## Training

We provide a [pre-trained model checkpoint](https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz), but we also detail how to train from scratch here.
Note that this example demonstrates the ColBERTv1 style of training, but the provided checkpoint was trained with ColBERTv2.

#### FAISS Indexing for end-to-end retrieval
Training requires a JSONL triples file with a `[qid, pid+, pid-]` list per line. The query IDs and passage IDs correspond to the specified `queries.tsv` and `collection.tsv` files respectively.

For end-to-end retrieval, you should index the document representations into [FAISS](https://github.com/facebookresearch/faiss).
Example usage (training on 4 GPUs):

```
python -m colbert.index_faiss \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--partitions 32768 --sample 0.3 \
--root /root/to/experiments/ --experiment MSMARCO-psg
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Trainer
if __name__=='__main__':
with Run().context(RunConfig(nranks=4, experiment="msmarco")):
config = ColBERTConfig(
bsize=32,
root="/path/to/experiments",
)
trainer = Trainer(
triples="/path/to/MSMARCO/triples.train.small.tsv",
queries="/path/to/MSMARCO/queries.train.small.tsv",
collection="/path/to/MSMARCO/collection.tsv",
config=config,
)
checkpoint_path = trainer.train()
print(f"Saved checkpoint to {checkpoint_path}...")
```

## Running a lightweight ColBERTv2 server
We provide a script to run a lightweight server which serves k (upto 100) results in ranked order for a given search query, in JSON format. This script can be used to power DSP programs.

## Retrieval

In the simplest case, you want to retrieve from the full collection:

To run the server, update the environment variables `INDEX_ROOT` and `INDEX_NAME` in the `.env` file to point to the appropriate ColBERT index. The run the following command:
```
python -m colbert.retrieve \
--amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--queries /path/to/MSMARCO/queries.dev.small.tsv \
--nprobe 32 --partitions 32768 --faiss_depth 1024 \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--root /root/to/experiments/ --experiment MSMARCO-psg
python server.py
```

You may also want to re-rank a top-k set that you've retrieved before with ColBERT or with another model. For this, use `colbert.rerank` similarly and additionally pass `--topk`.

If you have a large set of queries (or want to reduce memory usage), use **batch-mode** retrieval and/or re-ranking. This can be done by passing `--batch --retrieve_only` to `colbert.retrieve` and passing `--batch --log-scores` to colbert.rerank alongside `--topk` with the `unordered.tsv` output of this retrieval run.

Some use cases (e.g., building a user-facing search engines) require more control over retrieval. For those, you typically don't want to use the command line for retrieval. Instead, you want to import our retrieval API from Python and directly work with that (e.g., to build a simple REST API). Instructions for this are coming soon, but you will just need to adapt/modify the retrieval loop in [`colbert/ranking/retrieval.py#L33`](colbert/ranking/retrieval.py#L33).
A sample query:
```
http://localhost:8893/api/search?query=Who won the 2022 FIFA world cup&k=25
```

## Branches

## Releases
### Supported branches

* v0.2.0: Sep 2020
* v0.1.0: June 2020
* [`main`](https://github.com/stanford-futuredata/ColBERT/tree/main): Stable branch with ColBERTv2 + PLAID.
* [`colbertv1`](https://github.com/stanford-futuredata/ColBERT/tree/colbertv1): Legacy branch for ColBERTv1.

### Deprecated branches
* [`new_api`](https://github.com/stanford-futuredata/ColBERT/tree/new_api): Base ColBERTv2 implementation.
* [`cpu_inference`](https://github.com/stanford-futuredata/ColBERT/tree/cpu_inference): ColBERTv2 implementation with CPU search support.
* [`fast_search`](https://github.com/stanford-futuredata/ColBERT/tree/fast_search): ColBERTv2 implementation with PLAID.
* [`binarization`](https://github.com/stanford-futuredata/ColBERT/tree/binarization): ColBERT with a baseline binarization-based compression strategy (as opposed to ColBERTv2's residual compression, which we found to be more robust).
Loading