-
Notifications
You must be signed in to change notification settings - Fork 95
Description
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request?
Medium
Please provide a clear description of problem this feature solves
Some algorithms may access a tile more than once within its cuda block without knowing its the same tile. This should not load same tile twice. Or not even store twice. Maybe, an optional mechanism with a maximum dedicated smem-cache size can help redundancy-related issues.
For example, if I'm developing an open-world video-game where a player looks around and sees world, it needs tiles around the player (assuming 2D world map). When computing things for the player, the access to tiles could be optimized by actively caching by developer or automatically by cutile. Because, why not? If its multiplayer, then 8 players could be in same cluster and use multicasting too. (assuming its cloud-gaming with B200 gpu)
Feature Description
Read-caching, write-caching, maybe cluster-based multicasting automatically.
Describe your ideal solution
LRU, LFU, direct-mapped, even multiple layers (block L1 -> cluster L2 -> TMA), anything with an eviction works.
Describe any alternatives you have considered
I have looked at google with "cuda TMA cache" but it returned with 0 results.
Additional context
Maybe Blackwell architecture's tensor-memory can be used as a scratchpad memory for this instead of shared-memory?
Contributing Guidelines
- I agree to follow cuTile Python's contributing guidelines
- I have searched the open feature requests and have found no duplicates for this feature request