Increasing memory usage when using replace_all_objects #557

AugPro · 2023-06-07T13:17:51Z

Hello, I have a memory issue when using replace_all_objects. When using this function with a significant amount of documents (5 Million), I use an iterator to minimize memory consumption.
I expect the memory usage to stay flat during the operation, however it keeps increasing. (cf image below)

Upon investigation, it looks like the cause of this memory usage increase comes from the function SearchIndex._chunk, and more specifically the list raw_responses, which stores responses for every request sent.

algoliasearch-client-python/algoliasearch/search_index.py

Lines 505 to 528 in 3bb9108

    
           def _chunk(self, action, objects, request_options, validate_object_id=True): 
        
               # type: (str, Union[List[dict], Iterator[dict]], Optional[Union[dict, RequestOptions]], bool) -> IndexingResponse # noqa: E501 
        
               raw_responses = [] 
        
               batch = [] 
        
               batch_size = self._config.batch_size 
        
               for obj in objects: 
        
                   batch.append(obj) 
        
                   if len(batch) == batch_size: 
        
                       if validate_object_id: 
        
                           assert_object_id(batch) 
        
                       requests = build_raw_response_batch(action, batch) 
        
                       raw_responses.append(self._raw_batch(requests, request_options)) 
        
                       batch = [] 
        
               if len(batch): 
        
                   if validate_object_id: 
        
                       assert_object_id(batch) 
        
                   requests = build_raw_response_batch(action, batch) 
        
                   raw_responses.append(self._raw_batch(requests, request_options)) 
        
               return IndexingResponse(self, raw_responses)

This is a problem because the response of /1/indexes/{indexName}/batch contains the list of objectIDs

{
  "taskID": 792,
  "objectIDs": ["6891", "6892"]
}

With 5M documents, each with an objectID of ~15 characters, this accounts for 300MB.

>>> sys.getsizeof("123456789012345") * 5_000_000 / (1024**2)
305.17578125

Is there a request_option for the API not to return objectIDs, or for the code not to store them in raw_responses ?

Thank you 🙏

The text was updated successfully, but these errors were encountered:

shortcuts · 2024-09-11T12:34:45Z

Hey there, we completely rewrote the implementation in v4, if by any chance you still use the client, could you let us know if this exists? Thanks, feel free to re-open the issue!

shortcuts closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing memory usage when using replace_all_objects #557

Increasing memory usage when using replace_all_objects #557

AugPro commented Jun 7, 2023 •

edited

Loading

shortcuts commented Sep 11, 2024

Increasing memory usage when using replace_all_objects #557

Increasing memory usage when using replace_all_objects #557

Comments

AugPro commented Jun 7, 2023 • edited Loading

shortcuts commented Sep 11, 2024

AugPro commented Jun 7, 2023 •

edited

Loading