Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Poor performance when uploading small batches to Milvus in VDB Upload example #1667

Open
2 tasks done
mdemoret-nv opened this issue Apr 24, 2024 · 0 comments · May be fixed by #1694
Open
2 tasks done

[BUG]: Poor performance when uploading small batches to Milvus in VDB Upload example #1667

mdemoret-nv opened this issue Apr 24, 2024 · 0 comments · May be fixed by #1694
Assignees
Labels
bug Something isn't working

Comments

@mdemoret-nv
Copy link
Contributor

Version

24.03

Which installation method(s) does this occur on?

Docker, Conda, Source

Describe the bug.

When using the Milvus service for writing to a vector database, the performance drops when using small batch sizes or infrequent writes. This is because the service wants to reindex the database after each message, or after a set time has elapsed (it is hard coded to 3 seconds). This is inefficient for a few reasons:

  • When uploading infrequently, if the reindex takes ~3 seconds, then you can get into a loop where you: add 1 message -> reindex (3 sec) -> add 1 message -> reindex (3 sec). This causes messages to back up and the service cannot keep up.
  • When uploading frequently with large batch sizes, reindexing can be triggered by the number of rows. This again can cause issues because the index can take longer than data is generated.

Ideally, we would use something similar to a debounce to update the index. So reindexing only occurs after some set time where no messages have been added.

Minimum reproducible example

milvus_service = MilvusVectorDBService(uri=milvus_server_uri)

# Create the collection
...

# Make a small dataframe with 5 rows
df = cudf.DataFrame({
    "id": list(range(num_input_rows)),
    "age": [random.randint(20, 40) for i in range(num_input_rows)],
    "embedding": [[random.random() for _ in range(3)] for _ in range(num_input_rows)]
})

# Add the rows to the collection in a loop
for _ in range(10000):

    milvus_service.insert_dataframe(collection_name, df)

    # Sleep some amount to allow the data to be inserted (this may need to be tweaked to trigger the bug)
    time.sleep(0.1)

Relevant log output

Click here to see error details

[Paste the error here, it will be hidden by default]

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@mdemoret-nv mdemoret-nv added the bug Something isn't working label Apr 24, 2024
@cwharris cwharris linked a pull request May 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

2 participants