-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk indexing giving intermittent 400s (due to 100ms timeout?) #788
Comments
I appreciate the very detailed report. I'm almost certain this has to do with when I originally wrote the bulk insert, I struggled with how to bubble errors back up to the top and return those for the request while not leaving a ton of open channels and threads floating around from a bad request. I'm almost certain an error is happening during the channel passing. I noticed immediately that there are unescaped new lines in your json. Can you humor me and try escaping the input so I can rule that out? As for the slow post times, those definitely seem wrong too. Is the index particularly large when you commit? Or is this an empty index? |
I don't have time to keep tinkering with this right now. I was using spare time on a Friday night just to check it out. The index as empty, and it got locked up pretty soon after I started bulk inserting. Based on what you're saying, it seems like a bug with bulk insert. I'll let you know if I get around to poking at this again. |
Describe the bug
I'm trying to use the _bulk endpoint. I read the tests in the code, and understand that it want's line-by-line JSON as the request body. I got it all working, and I'm trying to index Wikipedia articles that look like this:
I regularly (but not always) get 400s with no helpful response body. I don't see a panic or any logs in Toshi's stdout. From looking at the bulk_insert handler, it looks like it's probably to the index_documents call at the bottom. It seems that within index_document there is some sort of 100ms timeout. I've seen the error happen after a slight hiccup in my script's output, so I'm wondering if a delay within Toshi is causing the 400s.
It seems that if I use a small batch size, i.e. 5 or 10 records, the timeout is less likely, but I'm trying to insert 5m documents, so I wish to use a batch size of 100 or 1,000, and flush at the end.
Any ideas?
Thanks for sharing this project, it's really, really cool!
To Reproduce
Steps to reproduce the behavior:
Expected behavior
201 Created
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: