Refactored vertical_slash_index.cu for performance improvement #72
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR refactors the CUDA kernel in
vertical_slash_index.cu
to optimize the performance of vertical slash indexing for large context sizes in MInference. The changes are primarily focused on improving the efficiency of block and column indexing operations during inference.The refactoring led to a 6-second improvement in the execution time, as shown in the attached profiling report images. The optimization was achieved by reusing calculations within the CUDA kernel and improving the memory access patterns.
We tested the refactored code with 100K tokens using NVIDIA A30 GPUs (24GB memory each), which limits the scale of our tests compared to the original results with 1M tokens. The project maintainers are encouraged to rebuild and validate the changes with larger token sizes and on other GPU models like A100.
The profiling results (before and after refactor) are attached, showing a reduction in total inference time from 268.128 seconds to 262.092 seconds.
Performance Results
Below are the benchmark results comparing the original and refactored versions of MInference on NVIDIA A30 GPUs for 10K, 50K, and 100K tokens:
We request that you test the refactored version with an A100 GPU using your original settings, especially for larger token counts such as 1M tokens, as we were limited by the memory capacity of the A30 GPUs.
Fixes #(issue number, if applicable)
Motivation
Performance improvement: The motivation behind this refactor is to optimize the CUDA kernel used in
vertical_slash_index.cu
for better inference performance. Our changes focus on minimizing redundant calculations and improving memory access patterns during the inference process.Context size handling: The refactor improves the handling of large context sizes, particularly when working with limited GPU memory on A30 devices. This may benefit applications that need to process large-scale input data efficiently.
Dependencies
Before submitting:
Who can review?
Tagging relevant maintainers for this PR review:
Attached Image:
This version includes the performance benchmarks and additional requests for testing on A100 GPUs for larger contexts.
Additional Notes
While the refactor has improved performance, we observed that there are still numerous
cudaStreamSynchronize
points in the current pipeline. These synchronization points could be further analyzed and potentially optimized to reduce unnecessary blocking and improve the overall throughput of the system. We recommend reviewing these points to explore further optimizations.