You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Modifying the tiled DGEMM kernel code in gh-146 as below can lead to a segfault. While I realize the C++ Kokkos docs do advise checking the size of the shared memory caches before allocating them, this isn't really a Pythonic experience so we may need some kind of (arguably default-on) mode for auto-querying the size of the i.e., L1 (and so on) cache and refusing to compile it.
The argument in favor of default-on is similar to that for Cython--you need to explicitly opt out of helpful guardrails like bounds checking and so on to get the full-blown performance (i.e., you develop with guardrails on, then deploy to production/releases with i.e., decorators that disable the guardrails).
--- a/pykokkos/linalg/workunits.py+++ b/pykokkos/linalg/workunits.py@@ -46,7 +46,7 @@ def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
global_tid: int = team_member.league_rank() * team_member.team_size() + team_member.team_rank()
# TODO: I have no idea how to get 2D scratch memory views?
- scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)+ scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size * 100000)
scratch_mem_b: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
# in a 4 x 4 matrix with 2 x 2 tiling the leagues
# and teams have matching row/col assignment approaches
I wonder if the CI segfault we see over in the matching PR is related to some kind of prohibition on using L1 cache in the virtual machine or something??
The text was updated successfully, but these errors were encountered:
yeah, L1 is on-chip memory, there will not be much of it. Using it has to be coordinated with how many things you launch in parallel, hardware, etc. ... and if you use it you will basically reduce the amount of registers per thread ...
Unfortunately it will not work to just put any tile size in the ScratchView. This is probably the deepest layer of tweaking stuff that is available in kokkos ...
Nevertheless, we could try to put a maximum: static int scratch_size_max(int level);
Returns: the maximum total scratch size in bytes, for the given level. Note: If a kernel performs team-level reductions or scan operations, not all of this memory will be available for dynamic user requests. Some of that maximal scratch size is being used for internal operations. The actual size of these internal allocations depends on the value type used in the reduction or scan.
Modifying the tiled DGEMM kernel code in gh-146 as below can lead to a segfault. While I realize the C++ Kokkos docs do advise checking the size of the shared memory caches before allocating them, this isn't really a Pythonic experience so we may need some kind of (arguably default-on) mode for auto-querying the size of the i.e., L1 (and so on) cache and refusing to compile it.
The argument in favor of default-on is similar to that for Cython--you need to explicitly opt out of helpful guardrails like bounds checking and so on to get the full-blown performance (i.e., you develop with guardrails on, then deploy to production/releases with i.e., decorators that disable the guardrails).
I wonder if the CI segfault we see over in the matching PR is related to some kind of prohibition on using L1 cache in the virtual machine or something??
The text was updated successfully, but these errors were encountered: