Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking to Anserini "FakeWords" Issue #4

Open
JMMackenzie opened this issue Mar 22, 2023 · 4 comments
Open

Linking to Anserini "FakeWords" Issue #4

JMMackenzie opened this issue Mar 22, 2023 · 4 comments
Labels
question Further information is requested

Comments

@JMMackenzie
Copy link

Hi all,

I just saw your paper this morning, great work! I had a quick look here at the repo and noticed that you have looked into a method to better deal with weighted documents than the Anserini "Fake Words" method.

They have an issue open on this: castorini/anserini#1890

It would be awesome if you could make a PR there, as this is a pain point for indexing huge collections.

We also ran into a super weird corner case bug in the past relating to the jsonvector method, see: castorini/anserini#1843

Anyway, just wanted to point it out because it would be nice to contribute it back (I realise you probably intended to do this anyway, but I may as well mention it while it's on my mind).

Cheers!

@thongnt99
Copy link
Owner

Hi @JMMackenzie,
It is in our plan to merge back to the Anserini. The implementation is ready in this repo, however some tests failed due to the new changes. I will create a pull request soon and discuss how to fix or create new tests.

@JMMackenzie
Copy link
Author

Awesome, glad to hear it! Did you happen to capture how much time overhead the fakewords method was adding?

@thongnt99
Copy link
Owner

I don't have a systematic comparison for all the models, but we observed the indexing time of (e.g, EPIC topk=400) reduces from 1 hours to under 15 minutes on our machine.

@thongnt99
Copy link
Owner

I created a pull request here: castorini/anserini#2080

@thongnt99 thongnt99 added the question Further information is requested label Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants