Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored vertical_slash_index.cu for performance improvement #72

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alvi75
Copy link

@alvi75 alvi75 commented Sep 9, 2024

Screenshot 2024-09-09 at 10 49 08 AM

What does this PR do?

This PR refactors the CUDA kernel in vertical_slash_index.cu to optimize the performance of vertical slash indexing for large context sizes in MInference. The changes are primarily focused on improving the efficiency of block and column indexing operations during inference.

The refactoring led to a 6-second improvement in the execution time, as shown in the attached profiling report images. The optimization was achieved by reusing calculations within the CUDA kernel and improving the memory access patterns.

We tested the refactored code with 100K tokens using NVIDIA A30 GPUs (24GB memory each), which limits the scale of our tests compared to the original results with 1M tokens. The project maintainers are encouraged to rebuild and validate the changes with larger token sizes and on other GPU models like A100.

The profiling results (before and after refactor) are attached, showing a reduction in total inference time from 268.128 seconds to 262.092 seconds.

Performance Results

Below are the benchmark results comparing the original and refactored versions of MInference on NVIDIA A30 GPUs for 10K, 50K, and 100K tokens:

GPU Version 10K Tokens (s) 50K Tokens (s) 100K Tokens (s)
A30 (2 GPUs) MInference 3.86 12.82 26.02
A30 (2 GPUs) MInference (Refactored) 3.796 12.88 25.98

We request that you test the refactored version with an A100 GPU using your original settings, especially for larger token counts such as 1M tokens, as we were limited by the memory capacity of the A30 GPUs.


Fixes #(issue number, if applicable)

Motivation

  • Performance improvement: The motivation behind this refactor is to optimize the CUDA kernel used in vertical_slash_index.cu for better inference performance. Our changes focus on minimizing redundant calculations and improving memory access patterns during the inference process.

  • Context size handling: The refactor improves the handling of large context sizes, particularly when working with limited GPU memory on A30 devices. This may benefit applications that need to process large-scale input data efficiently.


Dependencies

  • No new dependencies were introduced by this PR.

Before submitting:

  • This PR fixes a performance bottleneck in the inference pipeline.
  • This was discussed/approved via a GitHub issue.
  • The documentation was updated where necessary.
  • Tests were performed using 100K tokens on NVIDIA A30 GPUs.
  • Additional tests are recommended for larger context sizes (e.g., 1M tokens) and on different hardware (e.g., A100 GPUs).

Who can review?

Tagging relevant maintainers for this PR review:


Attached Image:

  • Profiling 1: Before refactor (268.128 seconds)
  • Profiling 2: After refactor (262.092 seconds)

This version includes the performance benchmarks and additional requests for testing on A100 GPUs for larger contexts.

Additional Notes

While the refactor has improved performance, we observed that there are still numerous cudaStreamSynchronize points in the current pipeline. These synchronization points could be further analyzed and potentially optimized to reduce unnecessary blocking and improve the overall throughput of the system. We recommend reviewing these points to explore further optimizations.

@alvi75
Copy link
Author

alvi75 commented Sep 9, 2024

@microsoft-github-policy-service agree company="William & Mary"

@iofu728 iofu728 added the feature feature label Sep 9, 2024
@Starmys
Copy link
Contributor

Starmys commented Oct 9, 2024

Thank you for your contribution! Could you provide your CUDA version so we can check if this is related to compiler optimization ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants