Skip to content

Commit

Permalink
Add regression for MS MARCO doc with per-doc docTTTTTquery expansions (
Browse files Browse the repository at this point in the history
…castorini#1408)

Also fixed terminology: Retrieval -> Ranking
  • Loading branch information
lintool authored Nov 12, 2020
1 parent 22c0ad3 commit 9a8e8b4
Show file tree
Hide file tree
Showing 10 changed files with 257 additions and 89 deletions.
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,11 @@ For the most part, these runs are based on [_default_ parameter settings](https:
+ [Regressions for Complex Answer Retrieval v1.5 (CAR17)](docs/regressions-car17v1.5.md)
+ [Regressions for Complex Answer Retrieval v2.0 (CAR17)](docs/regressions-car17v2.0.md)
+ [Regressions for Complex Answer Retrieval v2.0 (CAR17) with doc2query expansion](docs/regressions-car17v2.0-doc2query.md)
+ [Regressions for the MS MARCO Passage Retrieval Task](docs/regressions-msmarco-passage.md)
+ [Regressions for the MS MARCO Passage Retrieval Task with doc2query expansion](docs/regressions-msmarco-passage-doc2query.md)
+ [Regressions for the MS MARCO Passage Retrieval Task with docTTTTTquery expansion](docs/regressions-msmarco-passage-docTTTTTquery.md)
+ [Regressions for the MS MARCO Document Retrieval](docs/regressions-msmarco-doc.md)
+ [Regressions for MS MARCO Passage Ranking](docs/regressions-msmarco-passage.md)
+ [Regressions for MS MARCO Passage Ranking with doc2query expansion](docs/regressions-msmarco-passage-doc2query.md)
+ [Regressions for MS MARCO Passage Ranking with docTTTTTquery expansion](docs/regressions-msmarco-passage-docTTTTTquery.md)
+ [Regressions for MS MARCO Document Ranking ](docs/regressions-msmarco-doc.md)
+ [Regressions for MS MARCO Document Ranking with per-doc docTTTTTquery expansion](docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md)
+ [Regressions for the TREC 2019 Deep Learning Track (Passage Ranking Task)](docs/regressions-dl19-passage.md)
+ [Regressions for the TREC 2019 Deep Learning Track (Document Ranking Task)](docs/regressions-dl19-doc.md)
+ [Regressions for the TREC 2018 News Track (Background Linking Task)](docs/regressions-backgroundlinking18.md)
Expand All @@ -90,8 +91,8 @@ For the most part, manual copying and pasting of commands into a shell is requir
+ [Baselines for the TREC-COVID Challenge using doc2query](docs/experiments-covid-doc2query.md)
+ [Working with the 20 Newsgroups Dataset](docs/experiments-20newsgroups.md)
+ [Replicating "Neural Hype" Experiments](docs/experiments-forum2018.md)
+ [Guide to BM25 baselines for the MS MARCO Passage Retrieval Task](docs/experiments-msmarco-passage.md)
+ [Guide to BM25 baselines for the MS MARCO Document Retrieval Task](docs/experiments-msmarco-doc.md)
+ [Guide to BM25 baselines for the MS MARCO Passage Ranking Task](docs/experiments-msmarco-passage.md)
+ [Guide to BM25 baselines for the MS MARCO Document Ranking Task](docs/experiments-msmarco-doc.md)
+ [Guide to BM25 baselines for the FEVER Fact Verification Task](docs/experiments-fever.md)
+ [Guide to replicating doc2query results](docs/experiments-doc2query.md) (MS MARCO passage ranking and TREC-CAR)
+ [Guide to replicating docTTTTTquery results](docs/experiments-docTTTTTquery.md) (MS MARCO passage and document ranking)
Expand Down
2 changes: 1 addition & 1 deletion docs/experiments-msmarco-doc.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Anserini: BM25 Baselines for MS MARCO Doc Retrieval
# Anserini: BM25 Baselines for MS MARCO Document Ranking

This page contains instructions for running BM25 baselines on the [MS MARCO *document* ranking task](https://microsoft.github.io/msmarco/).
Note that there is a separate [MS MARCO *passage* ranking task](experiments-msmarco-passage.md).
Expand Down
2 changes: 1 addition & 1 deletion docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Anserini: BM25 Baselines for MS MARCO Passage Retrieval
# Anserini: BM25 Baselines for MS MARCO Passage Ranking

This page contains instructions for running BM25 baselines on the [MS MARCO *passage* ranking task](https://microsoft.github.io/msmarco/).
Note that there is a separate [MS MARCO *document* ranking task](experiments-msmarco-doc.md).
Expand Down
60 changes: 60 additions & 0 deletions docs/regressions-msmarco-doc-docTTTTTquery-per-doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Anserini: Regressions for MS MARCO Document Ranking

This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking) with per-document docTTTTTquery document expansion, which is integrated into Anserini's regression testing framework.
For more complete instructions on how to run end-to-end experiments, refer to [this page](https://github.com/castorini/docTTTTTquery#Replicating-MS-MARCO-Document-Ranking-Results-with-Anserini).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-doc.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-doc.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.

## Indexing

Typical indexing command:

```
nohup sh target/appassembler/bin/IndexCollection -collection JsonCollection \
-input /path/to/msmarco-doc-docTTTTTquery-per-doc \
-index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc.pos+docvectors+raw \
-generator DefaultLuceneDocumentGenerator \
-threads 1 -storePositions -storeDocvectors -storeRaw \
>& logs/log.msmarco-doc-docTTTTTquery-per-doc &
```

The directory `/path/to/msmarco-doc-docTTTTTquery-per-doc/` should be a directory containing the official document collection (a single file), in TREC format.

For additional details, see explanation of [common indexing options](common-indexing-options.md).

## Retrieval

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/).
The regression experiments here evaluate on the 5193 dev set questions.

After indexing has completed, you should be able to perform retrieval as follows:

```
nohup target/appassembler/bin/SearchCollection -index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-doc.pos+docvectors+raw \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.msmarco-doc.dev.txt \
-bm25 &
```

Evaluation can be performed using `trec_eval`:

```
tools/eval/trec_eval.9.0.4/trec_eval -m map -c -m recall.1000 -c src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc-docTTTTTquery-per-doc.bm25-default.topics.msmarco-doc.dev.txt
```

## Effectiveness

With the above commands, you should be able to replicate the following results:

MAP | BM25 (Default)|
:---------------------------------------|-----------|
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2886 |


R@1000 | BM25 (Default)|
:---------------------------------------|-----------|
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.9259 |

See [this page](https://github.com/castorini/docTTTTTquery#Replicating-MS-MARCO-Document-Ranking-Results-with-Anserini) for more details.
Note that here we are using `trec_eval` to evaluate the top 1000 hits for each query; beware, the runs provided by MS MARCO organizers for reranking have only 100 hits per query.
10 changes: 5 additions & 5 deletions docs/regressions-msmarco-doc.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Anserini: Regressions for [MS MARCO (Document)](https://github.com/microsoft/TREC-2019-Deep-Learning)
# Anserini: Regressions for MS MARCO Document Ranking

This page documents regression experiments for the MS MARCO Document Ranking Task, which is integrated into Anserini's regression testing framework.
This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework.
For more complete instructions on how to run end-to-end experiments, refer to [this page](experiments-msmarco-doc.md).

The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc.yaml).
Expand All @@ -26,7 +26,7 @@ For additional details, see explanation of [common indexing options](common-inde
## Retrieval

Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/).
The regression experiments here evaluate on the 5193 dev set questions; see [this page](experiments-msmarco-doc.md) for more details.
The regression experiments here evaluate on the 5193 dev set questions.

After indexing has completed, you should be able to perform retrieval as follows:

Expand Down Expand Up @@ -98,12 +98,12 @@ With the above commands, you should be able to replicate the following results:

MAP | BM25 (Default)| +RM3 | +Ax | +PRF | BM25 (Tuned)| +RM3 | +Ax | +PRF |
:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/TREC-2019-Deep-Learning)| 0.2310 | 0.1632 | 0.1147 | 0.1357 | 0.2788 | 0.2289 | 0.1895 | 0.1559 |
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.2310 | 0.1632 | 0.1147 | 0.1357 | 0.2788 | 0.2289 | 0.1895 | 0.1559 |


R@1000 | BM25 (Default)| +RM3 | +Ax | +PRF | BM25 (Tuned)| +RM3 | +Ax | +PRF |
:---------------------------------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/TREC-2019-Deep-Learning)| 0.8856 | 0.8785 | 0.8369 | 0.8471 | 0.9326 | 0.9320 | 0.9264 | 0.8758 |
[MS MARCO Document Ranking: Dev Queries](https://github.com/microsoft/MSMARCO-Document-Ranking)| 0.8856 | 0.8785 | 0.8369 | 0.8471 | 0.9326 | 0.9320 | 0.9264 | 0.8758 |

The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`, while "tuned" refers to the tuned setting of `k1=3.44`, `b=0.87`.
See [this page](experiments-msmarco-doc.md) for more details.
Expand Down
Loading

0 comments on commit 9a8e8b4

Please sign in to comment.