Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing memory usage when using replace_all_objects #557

Closed
AugPro opened this issue Jun 7, 2023 · 1 comment
Closed

Increasing memory usage when using replace_all_objects #557

AugPro opened this issue Jun 7, 2023 · 1 comment

Comments

@AugPro
Copy link

AugPro commented Jun 7, 2023

Hello, I have a memory issue when using replace_all_objects. When using this function with a significant amount of documents (5 Million), I use an iterator to minimize memory consumption.
I expect the memory usage to stay flat during the operation, however it keeps increasing. (cf image below)
image

Upon investigation, it looks like the cause of this memory usage increase comes from the function SearchIndex._chunk, and more specifically the list raw_responses, which stores responses for every request sent.

def _chunk(self, action, objects, request_options, validate_object_id=True):
# type: (str, Union[List[dict], Iterator[dict]], Optional[Union[dict, RequestOptions]], bool) -> IndexingResponse # noqa: E501
raw_responses = []
batch = []
batch_size = self._config.batch_size
for obj in objects:
batch.append(obj)
if len(batch) == batch_size:
if validate_object_id:
assert_object_id(batch)
requests = build_raw_response_batch(action, batch)
raw_responses.append(self._raw_batch(requests, request_options))
batch = []
if len(batch):
if validate_object_id:
assert_object_id(batch)
requests = build_raw_response_batch(action, batch)
raw_responses.append(self._raw_batch(requests, request_options))
return IndexingResponse(self, raw_responses)

This is a problem because the response of /1/indexes/{indexName}/batch contains the list of objectIDs

{
  "taskID": 792,
  "objectIDs": ["6891", "6892"]
}

With 5M documents, each with an objectID of ~15 characters, this accounts for 300MB.

>>> sys.getsizeof("123456789012345") * 5_000_000 / (1024**2)
305.17578125

Is there a request_option for the API not to return objectIDs, or for the code not to store them in raw_responses ?

Thank you 🙏

@shortcuts
Copy link
Member

Hey there, we completely rewrote the implementation in v4, if by any chance you still use the client, could you let us know if this exists? Thanks, feel free to re-open the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants