Skip to content

Conversation

@a-hirota
Copy link

Description

This PR implements range-based locking mechanism for kvikio to enable truly parallel non-overlapping writes to files. This significantly
improves performance for multi-GPU workloads.

Key Changes

  • Added RangeLockManager class for managing non-overlapping range locks
  • Added FileHandleWithRangeLock extending FileHandle with range lock support
  • Comprehensive test coverage in both C++ and Python

Performance Benefits

  • Non-overlapping writes can execute in parallel
  • Reduces contention for multi-GPU file I/O operations
  • Maintains data integrity by serializing only overlapping ranges

Usage Example

C++ Usage

#include <kvikio/file_handle_rangelock.hpp>

// Create a file handle with range lock support
kvikio::FileHandleWithRangeLock file("output.bin", "w+");

// Multiple threads/GPUs can write to non-overlapping regions in parallel
std::vector<std::thread> threads;

// GPU 0 writes to first half of file
threads.emplace_back([&]() {
    void* gpu0_data = ...;  // GPU 0 data
    auto future = file.pwrite_rangelock(gpu0_data, size, 0);
    future.get();
});

// GPU 1 writes to second half - executes in parallel!
threads.emplace_back([&]() {
    void* gpu1_data = ...;  // GPU 1 data
    auto future = file.pwrite_rangelock(gpu1_data, size, size);
    future.get();
});

for (auto& t : threads) {
    t.join();
}

Python Usage (Future API)

  import kvikio
  import cupy as cp
  from concurrent.futures import ThreadPoolExecutor

  # When Python bindings are added:
  def write_partition(gpu_id, file_handle, offset, data):
      with cp.cuda.Device(gpu_id):
          # This would execute in parallel for non-overlapping regions
          file_handle.pwrite_rangelock(data, file_offset=offset)

  # Multiple GPUs writing to different file regions
  with kvikio.FileHandleWithRangeLock("output.bin", "w+") as f:
      with ThreadPoolExecutor() as executor:
          futures = []
          for gpu_id in range(num_gpus):
              offset = gpu_id * partition_size
              futures.append(
                  executor.submit(write_partition, gpu_id, f, offset, data[gpu_id])
              )
          # All non-overlapping writes execute in parallel
          for future in futures:
              future.result()

Testing

  • Added C++ tests in cpp/tests/test_range_lock.cpp
  • Added Python tests in python/kvikio/tests/test_range_lock.py
  • Tests cover:
    • Non-overlapping parallel writes
    • Overlapping range serialization
    • Move semantics
    • Performance benchmarks

Performance Impact

In our tests with 2 GPUs writing to non-overlapping regions:

  • Without range lock: Serial execution (one GPU waits for the other)
  • With range lock: Parallel execution (both GPUs write simultaneously)
  • Expected speedup: Near-linear with number of GPUs for non-overlapping writes

Implement range-based locking mechanism to enable truly parallel
non-overlapping writes to files. This improves performance for
multi-GPU workloads by allowing concurrent writes to different
file regions while serializing only overlapping operations.

Key changes:
- Add RangeLockManager class for managing non-overlapping range locks
- Add FileHandleWithRangeLock extending FileHandle with range lock support
- Add comprehensive C++ and Python tests for range lock functionality
- Support move semantics for efficient lock transfer

Performance benefits:
- Non-overlapping writes can execute in parallel
- Reduces contention for multi-GPU file I/O operations
- Maintains data integrity by serializing overlapping ranges
Copilot AI review requested due to automatic review settings September 30, 2025 12:36
@a-hirota a-hirota requested review from a team as code owners September 30, 2025 12:36
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a range-based locking mechanism for kvikio to enable parallel file I/O operations on non-overlapping file regions. The implementation allows multiple threads or GPUs to write to different sections of a file simultaneously while serializing only overlapping operations.

Key Changes

  • Added RangeLockManager class for managing overlapping range detection and locking
  • Added FileHandleWithRangeLock extending FileHandle with range-aware parallel I/O operations
  • Comprehensive test coverage in both C++ and Python to validate parallel execution and data integrity

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
cpp/include/kvikio/range_lock.hpp Core range lock manager with overlap detection and RAII lock semantics
cpp/include/kvikio/file_handle_rangelock.hpp Extended file handle with range-locked read/write operations
cpp/tests/test_range_lock.cpp C++ unit tests for range locking functionality and performance
python/kvikio/tests/test_range_lock.py Python integration tests for concurrent file operations
cpp/tests/CMakeLists.txt Build configuration for range lock tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +41 to +46
return std::async(std::launch::deferred, [future = std::move(future),
lock = std::move(range_lock)]() mutable {
auto result = future.get();
// Lock will be automatically released when this lambda exits
return result;
});
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::launch::deferred prevents the operation from running in parallel. The lambda will only execute when .get() is called, defeating the purpose of parallel I/O. Consider using std::launch::async or a different approach to maintain parallelism while ensuring proper lock cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +66
return std::async(std::launch::deferred, [future = std::move(future),
lock = std::move(range_lock)]() mutable {
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same deferred execution issue as in pwrite_rangelock. This prevents the read operation from running in parallel, contradicting the parallel I/O goals of the feature.

Suggested change
return std::async(std::launch::deferred, [future = std::move(future),
lock = std::move(range_lock)]() mutable {
return std::async(std::launch::async, [future = std::move(future),
lock = std::move(range_lock)]() mutable {

Copilot uses AI. Check for mistakes.
@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 30, 2025
@madsbk
Copy link
Member

madsbk commented Sep 30, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 30, 2025

/ok to test

@madsbk, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@madsbk
Copy link
Member

madsbk commented Sep 30, 2025

/ok to test 68ff040

bool sync_default_stream = true) {

// Acquire range lock for this write
auto range_lock = range_lock_manager_.lock_range(file_offset, file_offset + size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KvikIO has transitioned from a header-only library to a shared library. So all the implementations should go to corresponding .cpp files.

* Copyright (c) 2025, NVIDIA CORPORATION.
*
* Modified FileHandle with range-based locking support
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*
* This version acquires a lock only for the specific range being written,
* allowing non-overlapping writes to proceed in parallel.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function parameters and return values should be commented in Doxygen format.


namespace kvikio {

class FileHandleWithRangeLock : public FileHandle {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the class needs to be commented in detail too, to explain the purpose of this class and also key implementation details. For example, when multiple write requests contend on a common range, what would the behavior be?

@madsbk
Copy link
Member

madsbk commented Sep 30, 2025

@a-hirota, thanks for contributing! When you say locking, do you mean locking via this API? We’re not doing any file-based locking here, right?

Comment on lines +41 to +45
return std::async(std::launch::deferred, [future = std::move(future),
lock = std::move(range_lock)]() mutable {
auto result = future.get();
// Lock will be automatically released when this lambda exits
return result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parallel I/O, the step to wait for the chunked tasks is performed in the thread pool (https://github.com/rapidsai/kvikio/blob/branch-25.12/cpp/include/kvikio/parallel_operation.hpp#L175-L184). Would it be possible that this unlock step here follows suit and gets pushed to the thread pool's task queue as well?

@kingcrimsontianyu
Copy link
Contributor

Thanks for submitting the PR for review.

For completely new features, it is usually a good practice to file an issue first that explains the need/motivation/objective, before diving into implementation details. I think it would be of great help if you could elaborate on the use case of this range locking feature. Are you developing a database system based on KvikIO? You mentioned that this PR significantly improves the performance for multi-GPU. Is there any benchmark result to share?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants