Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I wanted to improve
nvrtccfor quite some time, I wanted to add support for linking device libraries, so we can test CUDA device runtime functionality.First I started improving the current implementation, but then got an idea: is there a possibility that we could just use
nvccbut hijack the compilation at the right moment to invokenvrtcinstead ofnvcc's device compilation?Actually, I made it work! When compiling CUDA device code,
nvccuses host compiler to preprocess the source files and passes it tociccwhich compiles the CUDA code and generates PTX output.The idea is that we give
nvcca customciccbinary that overwrites the generated PTX file with the PTX file compiled withnvrtc.This approach has many advantages, because we basically need't to rewrite the whole compilation pipeline. And on top of that we can only overwrite the PTX and still use all of the other functionality provided by
cicc.So.. we can just simply call all of the kernels from host code (!!!) because
nvccdoes all of the necessary linking and other magic for us. That means that we can use thenvrtccwith ordinary.cufiles (with proper guards for host code) and we can start testing other configurations than just 1 thread and 1 block withnvrtc.The implementation is not complete yet, I'm still missing some options, but I am curious what do you think about this!