Skip to content

Commit

Permalink
Improve documentation of cuda::barrier (#440)
Browse files Browse the repository at this point in the history
Co-authored-by: gonzalobg <[email protected]>
  • Loading branch information
ahendriksen and gonzalobg authored Sep 18, 2023
1 parent 6be81eb commit d066a57
Showing 1 changed file with 20 additions and 19 deletions.
39 changes: 20 additions & 19 deletions libcudacxx/docs/extended_api/synchronization_primitives/barrier.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,29 @@ class cuda::barrier;

The class template `cuda::barrier` is an extended form of [`cuda::std::barrier`]
that takes an additional [`cuda::thread_scope`] argument.
It has the same interface and semantics as [`cuda::std::barrier`], with the
following additional operations.

## Barrier Operations
If `!(scope == thread_block_scope && __isShared(this))`, then the semantics are
the same as [`cuda::std::barrier`]; otherwise, see below.
The `cuda::barrier` class templates extends `cuda::std::barrier` with the following additional operations:

| [`cuda::barrier::init`] | Initialize a `cuda::barrier`. `(friend function)` |
| [`cuda::device::barrier_native_handle`] | Get the native handle to a `cuda::barrier`. `(function)` |
| [`cuda::device::barrier_arrive_tx`] | Arrive on a `cuda::barrier` with transaction count update. `(function)` |
| [`cuda::device::barrier_arrive_tx`] | Arrive on a `cuda::barrier<cuda::thread_scope_block>` with transaction count update. `(function)` |


If `scope == thread_scope_block && __isShared(this)`, then the
semantics of [[thread.barrier.class]](http://eel.is/c++draft/thread.barrier.class) of ISO/IEC
IS 14882 (the C++ Standard) are modified as follows:

> A barrier is a thread coordination mechanism whose lifetime consists of a sequence of barrier phases, where each phase allows at most an expected number of threads to block until the expected number of threads **and the expected number of transaction-based asynchronous operations** arrive at the barrier.
> Each _barrier phase_ consists of the following steps:
>
> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**.
> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.**
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, **`cuda::device::barrier_arrive_tx`**, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, and the next phase starts.
>
> Concurrent invocations of the member functions of barrier **and the non-member barrier APIs in `cuda::device`**, other than its destructor, do not introduce data races. The member functions `arrive` and `arrive_and_drop`, **and the non-member function `cuda::device::barrier_arrive_tx`**, execute atomically.

## NVCC `__shared__` Initialization Warnings

Expand Down Expand Up @@ -82,20 +97,6 @@ asynchronous operations, the tx-count of the `cuda::barrier` will be updated and
thus progress the `cuda::barrier` towards the completion of the current phase.
This may complete the current phase.

### Phase Completion of a `cuda::barrier` with tx-count support

Modify [[thread.barrier.class]](http://eel.is/c++draft/thread.barrier.class) as follows:

> A barrier is a thread coordination mechanism whose lifetime consists of a sequence of barrier phases, where each phase allows at most an expected number of threads to block until the expected number of threads **and the expected number of transaction-based asynchronous operations** arrive at the barrier.
> Each _barrier phase_ consists of the following steps:
>
> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**.
> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.**
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts.
>
> Concurrent invocations of the member functions of barrier **and the non-member barrier APIs in `cuda::device`**, other than its destructor, do not introduce data races. The member functions `arrive` and `arrive_and_drop`, **and the non-member function `cuda::device::barrier_arrive_tx`**, execute atomically.
## Implementation-Defined Behavior

For each [`cuda::thread_scope`] `S` and `CompletionFunction` `F`, the value of
Expand Down

0 comments on commit d066a57

Please sign in to comment.