Linking to Anserini "FakeWords" Issue #4

JMMackenzie · 2023-03-22T04:54:55Z

Hi all,

I just saw your paper this morning, great work! I had a quick look here at the repo and noticed that you have looked into a method to better deal with weighted documents than the Anserini "Fake Words" method.

They have an issue open on this: castorini/anserini#1890

It would be awesome if you could make a PR there, as this is a pain point for indexing huge collections.

We also ran into a super weird corner case bug in the past relating to the jsonvector method, see: castorini/anserini#1843

Anyway, just wanted to point it out because it would be nice to contribute it back (I realise you probably intended to do this anyway, but I may as well mention it while it's on my mind).

Cheers!

thongnt99 · 2023-03-22T09:07:35Z

Hi @JMMackenzie,
It is in our plan to merge back to the Anserini. The implementation is ready in this repo, however some tests failed due to the new changes. I will create a pull request soon and discuss how to fix or create new tests.

JMMackenzie · 2023-03-23T07:04:03Z

Awesome, glad to hear it! Did you happen to capture how much time overhead the fakewords method was adding?

thongnt99 · 2023-03-23T10:22:43Z

I don't have a systematic comparison for all the models, but we observed the indexing time of (e.g, EPIC topk=400) reduces from 1 hours to under 15 minutes on our machine.

thongnt99 · 2023-03-23T14:46:07Z

I created a pull request here: castorini/anserini#2080

thongnt99 added the question Further information is requested label Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking to Anserini "FakeWords" Issue #4

Linking to Anserini "FakeWords" Issue #4

JMMackenzie commented Mar 22, 2023

thongnt99 commented Mar 22, 2023

JMMackenzie commented Mar 23, 2023

thongnt99 commented Mar 23, 2023

thongnt99 commented Mar 23, 2023

Linking to Anserini "FakeWords" Issue #4

Linking to Anserini "FakeWords" Issue #4

Comments

JMMackenzie commented Mar 22, 2023

thongnt99 commented Mar 22, 2023

JMMackenzie commented Mar 23, 2023

thongnt99 commented Mar 23, 2023

thongnt99 commented Mar 23, 2023