This guide contains instructions for running BM25 baselines on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task.
Setup Note: If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources such as RAM > 6GB and storage ~ 100 (can also be around 70 - 80 for this task) GB (SSD). This will prevent going back and fixing machine configuration again and again. If you get a configuration which works for Anserini on this task, it will work with Pyserini as well.
The guide requires the development installation for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via pip
, see this guide.
We're going to use collections/msmarco-passage/
as the working directory.
First, we need to download and extract the MS MARCO passage dataset:
mkdir collections/msmarco-passage
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage
tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
To confirm, collectionandqueries.tar.gz
should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69
.
Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl
, each with 1M lines (except for the last one, which should have 841,823 lines).
We can now index these docs as a JsonCollection
using Anserini:
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-threads 9 -input collections/msmarco-passage/collection_jsonl \
-index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw
Note that the indexing program simply dispatches command-line arguments to an underlying Java program, and so we use the Java single dash convention, e.g., -index
and not --index
.
Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.
The 6980 queries in the development set are already stored in the repo. Let's take a peek:
$ head tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
1048585 what is paula deen's brother
2 Androgen receptor define
524332 treating tension headaches without medication
1048642 what is paranoid sc
524447 treatment of varicose veins in legs
786674 what is prime rate in canada
1048876 who plays young dr mallard on ncis
1048917 what is operating system misconfiguration
786786 what is priority pass
524699 tricare service number
$ wc tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
6980 48335 290193 tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:
python -m pyserini.search --topics msmarco-passage-dev-subset \
--index indexes/lucene-index-msmarco-passage \
--output runs/run.msmarco-passage.bm25tuned.txt \
--bm25 --output-format msmarco --hits 1000 --k1 0.82 --b 0.68
Here, we set the BM25 parameters to k1=0.82
, b=0.68
(tuned by grid search).
The option --output-format msmarco
says to generate output in the MS MARCO output format.
The option --hits
specifies the number of documents to return per query.
Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.
Retrieval speed will vary by hardware:
On a reasonably modern CPU with an SSD, we might get around 13 qps (queries per second), and so the entire run should finish in under ten minutes (using a single thread).
We can perform multi-threaded retrieval by using the --threads
and --batch-size
arguments.
For example, setting --threads 16 --batch-size 64
on a CPU with sufficient cores, the entire run will finish in a couple of minutes.
After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:
$ python tools/scripts/msmarco/msmarco_passage_eval.py \
tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25tuned.txt
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
We can also use the official TREC evaluation tool, trec_eval
, to compute metrics other than MRR@10.
For that we first need to convert the run file into TREC format:
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
--input runs/run.msmarco-passage.bm25tuned.txt --output runs/run.msmarco-passage.bm25tuned.trec
$ python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
--input tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt --output collections/msmarco-passage/qrels.dev.small.trec
And then run the trec_eval
tool:
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec
map all 0.1957
recall_1000 all 0.8573
Average precision or AP (also called mean average precision, MAP) and recall@1000 (recall at rank 1000) are the two metrics we care about the most. AP captures aspects of both precision and recall in a single metric, and is the most common metric used by information retrieval researchers. On the other hand, recall@1000 provides the upper bound effectiveness of downstream reranking modules (i.e., rerankers are useless if there isn't a relevant document in the results).
Reproduction Log*
- Results reproduced by @JeffreyCA on 2020-09-14 (commit
49fd7cb
) - Results reproduced by @jhuang265 on 2020-09-14 (commit
2ed2acc
) - Results reproduced by @Dahlia-Chehata on 2020-11-11 (commit
8172015
) - Results reproduced by @rakeeb123 on 2020-12-07 (commit
3bcd4e5
) - Results reproduced by @jrzhang12 on 2021-01-03 (commit
7caedfc
) - Results reproduced by @HEC2018 on 2021-01-04 (commit
46a6d47
) - Results reproduced by @KaiSun314 on 2021-01-08 (commit
aeec31f
) - Results reproduced by @yemiliey on 2021-01-18 (commit
98f3236
) - Results reproduced by @larryli1999 on 2021-01-22 (commit
74a87e4
) - Results reproduced by @ArthurChen189 on 2021-04-08 (commit
7261223
) - Results reproduced by @printfCalvin on 2021-04-12 (commit
0801f7f
) - Results reproduced by @saileshnankani on 2021-04-26 (commit
6d48609
) - Results reproduced by @andrewyguo on 2021-04-30 (commit
ecfed61
) - Results reproduced by @mayankanand007 on 2021-05-04 (commit
a9d6f66
) - Results reproduced by @rootofallevii on 2021-05-14 (commit
e764797
) - Results reproduced by @jpark621 on 2021-06-13 (commit
f614111
) - Results reproduced by @nimasadri11 on 2021-06-28 (commit
d31e2e6
) - Results reproduced by @mzzchy on 2021-07-05 (commit
45083f5
) - Results reproduced by @d1shs0ap on 2021-07-16 (commit
a6b6545
) - Results reproduced by @apokali on 2021-08-19 (commit
45a2fb4
) - Results reproduced by @leungjch on 2021-09-12 (commit
c71a69e
)