Open
Conversation
Contributor
Author
|
ok i see now you've used grid constant in the later kernels! :) |
Owner
|
Yeah this is a very initial prototype. Maybe we can skip this one, later we use upto 3d tma. 5d makes it easy to copy/paste and anything in future. I am fine on removing syncthreads, but this is mostly a basic prototype |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Few modifications:
__syncthreadsafter mbarrier wait -> it's superfluous (doesn't really affect performance, but it's an unnecessary bloat)Also importantly i think the way tensor map is currently passed is incorrect (and it's likely one of those bugs that show up 0.01% of time).
The docs describe 3 ways to pass in tensor map to the kernel, see this chapter. You've currently chosen to copy tensor map to global memory (using cudaMemcpy), from docs verbatim:
"Finally, it is possible to copy the tensor map to global memory. Using a pointer to a tensor map in global device memory requires a fence in each thread block before any thread in the block uses the updated tensor map. Further uses of the tensor map by that thread block do not need to be fenced unless the tensor map is modified again. Note that this mechanism may be slower than the two mechanisms described above."
But the current code does not have this part included:
I tested this approach and indeed it does slow down the kernel by ~5 TFLOP/s. I also tested the grid_constant approach (which is recommended) and it's pretty much on par in terms of speed but didn't want to push the change until i see if you're even accepting PRs. :P Also I might be misinterpreting something here.
thanks Pranjal!