You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Decoupled look-back is the foundation of most CUB algorithms. The decoupled look-back's tile state consists of a message and a flag. Since the message part is considered opaque, we have to double its size to store a few flag bits. When the message type is U32, we double it and use the U64 tile state. When the message type is U64, we use ulonglong2. When the message size exceeds 64 bits, we can no longer combine the message and the flag into a single architectural word and have to allocate flags as a standalone memory, which forces us to use memory fences, significantly deteriorating performance.
There are often unused bits in the user types. For instance, if the user type is a pair of U8 and U64, there are padding bits that we could reuse for storing the flag. Alternatively, when dealing with 64-bit integers, the high bits are frequently unused. This is the case when these integers represent sizes or offsets.
Describe the solution you'd like
We should implement an intrusive version of decoupled look-back. The API would allow us to store the decoupled look-back flag in the message. CUB needs this API to optimize algorithms like partition and select for 64-bit problem sizes, but it would also be helpful on the user side.
The content you are editing has changed. Please copy your edits and refresh the page.
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
Decoupled look-back is the foundation of most CUB algorithms. The decoupled look-back's tile state consists of a message and a flag. Since the message part is considered opaque, we have to double its size to store a few flag bits. When the message type is U32, we double it and use the U64 tile state. When the message type is U64, we use
ulonglong2
. When the message size exceeds 64 bits, we can no longer combine the message and the flag into a single architectural word and have to allocate flags as a standalone memory, which forces us to use memory fences, significantly deteriorating performance.There are often unused bits in the user types. For instance, if the user type is a pair of U8 and U64, there are padding bits that we could reuse for storing the flag. Alternatively, when dealing with 64-bit integers, the high bits are frequently unused. This is the case when these integers represent sizes or offsets.
Describe the solution you'd like
We should implement an intrusive version of decoupled look-back. The API would allow us to store the decoupled look-back flag in the message. CUB needs this API to optimize algorithms like partition and select for 64-bit problem sizes, but it would also be helpful on the user side.
Tasks
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: