Skip to content

Latest commit

 

History

History
138 lines (104 loc) · 9.32 KB

experiments-msmarco-doc.md

File metadata and controls

138 lines (104 loc) · 9.32 KB

Anserini: BM25 Baselines for MS MARCO Document Ranking

This page contains instructions for running BM25 baselines on the MS MARCO document ranking task. Note that there is a separate MS MARCO passage ranking task.

Data Prep

We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO document dataset:

mkdir collections/msmarco-doc

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc

# Alternative mirror:
# wget https://www.dropbox.com/s/w6caao3sfx9nluo/msmarco-docs.trec.gz -P collections/msmarco-doc

To confirm, msmarco-docs.trec.gz should have MD5 checksum of d4863e4f342982b51b9a8fc668b2d0c0.

There's no need to uncompress the file, as Anserini can directly index gzipped files. Build the index with the following command:

nohup sh target/appassembler/bin/IndexCollection -collection CleanTrecCollection \
 -generator DefaultLuceneDocumentGenerator -threads 1 -input collections/msmarco-doc \
 -index indexes/msmarco-doc/lucene-index.msmarco-doc.pos+docvectors+rawdocs \
 -storePositions -storeDocvectors -storeRaw >& logs/log.msmarco-doc.pos+docvectors+rawdocs &

On a modern desktop with an SSD, indexing takes around 40 minutes. There should be a total of 3,213,835 documents indexed.

Performing Retrieval on the Dev Queries

After indexing finishes, we can do a retrieval run. The dev queries are already stored in our repo:

target/appassembler/bin/SearchCollection -topicreader TsvInt \
 -index indexes/msmarco-doc/lucene-index.msmarco-doc.pos+docvectors+rawdocs \
 -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
 -output runs/run.msmarco-doc.dev.bm25.txt -bm25

On a modern desktop with an SSD, the run takes around 12 minutes.

Evaluating the Results

After the run completes, we can evaluate with trec_eval:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2310
recall_1000           	all	0.8856

Let's compare to the baselines provided by Microsoft. First, download:

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
gunzip runs/msmarco-docdev-top100.gz

Then, run trec_eval to compare. Note that to be fair, we restrict evaluation to top 100 hits per topic (which is what Microsoft provides):

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
map                   	all	0.2219

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map                   	all	0.2303

We see that "out of the box" Anserini is already better!

BM25 Tuning

It is well known that BM25 parameter tuning is important. The above instructions use the Anserini (system-wide) default of k1=0.9, b=0.4.

Let's try to do better! We tuned BM25 using the queries found here: these are five different sets of 10k samples from the training queries (using the shuf command). Tuning was performed on each individual set (grid search, in tenth increments) and then we averaged parameter values across all five sets (this has the effect of regularization). Here, we optimized for average precision (AP). The tuned parameters using this approach are k1=3.44, b=0.87.

To perform a run with these parameters, issue the following command:

target/appassembler/bin/SearchCollection -topicreader TsvString \
 -index indexes/msmarco-doc/lucene-index.msmarco-doc.pos+docvectors+rawdocs \
 -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
 -output runs/run.msmarco-doc.dev.bm25.tuned.txt -bm25 -bm25.k1 3.44 -bm25.b 0.87

Here's the comparison between the Anserini default and tuned parameters:

Setting AP Recall@1000
Default (k1=0.9, b=0.4) 0.2310 0.8856
Tuned (k1=3.44, b=0.87) 0.2788 0.9326

As expected, BM25 tuning makes a big difference!

Replication Log