Looking to do this as a Python package rather than having Python launch the C/CUDA code in a subprocess.