diff --git a/docs/data/understand/hipgraph/hip_graph.drawio b/docs/data/how-to/hipgraph/hip_graph.drawio similarity index 100% rename from docs/data/understand/hipgraph/hip_graph.drawio rename to docs/data/how-to/hipgraph/hip_graph.drawio diff --git a/docs/data/understand/hipgraph/hip_graph.svg b/docs/data/how-to/hipgraph/hip_graph.svg similarity index 100% rename from docs/data/understand/hipgraph/hip_graph.svg rename to docs/data/how-to/hipgraph/hip_graph.svg diff --git a/docs/data/understand/hipgraph/hip_graph_speedup.drawio b/docs/data/how-to/hipgraph/hip_graph_speedup.drawio similarity index 100% rename from docs/data/understand/hipgraph/hip_graph_speedup.drawio rename to docs/data/how-to/hipgraph/hip_graph_speedup.drawio diff --git a/docs/data/understand/hipgraph/hip_graph_speedup.svg b/docs/data/how-to/hipgraph/hip_graph_speedup.svg similarity index 100% rename from docs/data/understand/hipgraph/hip_graph_speedup.svg rename to docs/data/how-to/hipgraph/hip_graph_speedup.svg diff --git a/docs/how-to/hipgraph.rst b/docs/how-to/hipgraph.rst index 46293f6aeb..cc8ce147e3 100644 --- a/docs/how-to/hipgraph.rst +++ b/docs/how-to/hipgraph.rst @@ -1,210 +1,424 @@ .. meta:: - :description: This chapter describes how to use HIP graphs. + :description: This chapter describes how to use HIP graphs and highlights their use cases. :keywords: ROCm, HIP, graph, stream .. _how_to_HIP_graph: ******************************************************************************** -Using HIP graphs +HIP graphs ******************************************************************************** -This chapter explains how to create and use HIP graphs. To get a better understanding of -HIP graphs see :ref:`the understand-chapter about HIP graphs`. - -There are two different ways of creating graphs: Capturing kernel launches from a stream, or explicitly creating graphs. +.. note:: + The HIP graph API is currently in Beta. Some features can change and might + have outstanding issues. Not all features supported by CUDA graphs are yet + supported. For a list of all currently supported functions see the + :doc:`HIP graph API documentation<../doxygen/html/group___graph>`. + +HIP graphs are an alternative way of executing tasks on a GPU that can provide +performance benefits over launching kernels using the standard +method via streams. A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent +the operations performed, while the edges mark dependencies between those +operations. + +The nodes can be one of the following: + +- empty nodes +- nested graphs +- kernel launches +- host-side function calls +- HIP memory functions (copy, memset, ...) +- HIP events +- signalling or waiting on external semaphores + +.. note:: + The available node types are specified by ``hipGraphNodeType``. + +The following figure visualizes the concept of graphs, compared to using streams. + +.. figure:: ../data/how-to/hipgraph/hip_graph.svg + :alt: Diagram depicting the difference between using streams to execute + kernels with dependencies, resolved by explicitly calling + hipDeviceSynchronize, or using graphs, where the edges denote the + dependencies. + +The standard method of launching kernels incurs a small overhead +for each iteration of the operation involved. For kernels that perform large +operations during an iteration this overhead is usually negligible. However +in many workloads, such as scientific simulations and AI, a kernel might perform a +small operation over a great number of iterations, and so the overhead of repeatedly +launching kernels can have a significant impact on performance. + +HIP graphs are designed to address this issue, by predefining the HIP API calls +and their dependencies with a graph, and performing most of the initialization +beforehand. Launching a graph only requires a single call, after which the +driver takes care of executing the operations within the graph. +Graphs can provide additional performance benefits, by enabling optimizations +that are only possible when knowing the dependencies between the operations. + +.. figure:: ../data/how-to/hipgraph/hip_graph_speedup.svg + :alt: Diagram depicting the speed up achievable with HIP graphs compared to + HIP streams when launching many short-running kernels. + + Qualitative presentation of the execution time of many short-running kernels + when launched using HIP stream versus HIP graph. This does not include the + time needed to set up the graph. -Either way ends up with a ``hipGraph_t``, which is a template for a graph. -In order to actually launch a graph, the template needs to be instantiated using ``hipGraphInstantiate``, -which results in an executable graph of type ``hipGraphExec_t``. -This executable graph can then be launched with ``hipGraphLaunch``, -replaying the operations within the graph. -Stream capture -================================= +******************************************************************************** +Using HIP graphs +******************************************************************************** -The easy way to integrate graphs into already existing code is to use stream capture. +There are two different ways of creating graphs: Capturing kernel launches from +a stream, or explicitly creating graphs. The difference between the two +approaches is explained later in this chapter. + +The general flow for using HIP graphs includes the following steps. + +#. Create a ``hipGraph_t`` graph template using one of the two approaches described in this chapter +#. Create a ``hipGraphExec_t`` executable instance of the graph template using ``hipGraphInstantiate`` +#. Use ``hipGraphLaunch`` to launch the executable graph to a stream +#. After execution completes free and destroy graph resources + +The first two steps are the initial setup and only need to be executed once. First +step is the definition of the operations (nodes) and the dependencies (edges) +between them. The second step is the instantiation of the graph. This takes care +of validating and initializing the graph, to reduce the overhead when executing +the graph. The third step is the execution of the graph, which takes care of +launching all the kernels and executing the operations while respecting their +dependencies and necessary synchronizations as specified. + +Because HIP graphs require some setup and initialization overhead before their +first execution, graphs only provide a benefit for workloads that require +many iterations to complete. + +In both methods the ``hipGraph_t`` template for a graph is used to define the graph. +In order to actually launch a graph, the template needs to be instantiated using +``hipGraphInstantiate``, which results in an actually executable graph of type ``hipGraphExec_t``. +This executable graph can then be launched with ``hipGraphLaunch``, replaying the +operations within the graph. Note, that launching graphs is fundamentally no +different to executing other HIP functions on a stream, except for the fact, +that scheduling the operations within the graph encompasses less overhead and +can enable some optimizations, but they still need to be associated with a stream for execution. + +Memory management +----------------- + +Memory that is used by operations in graphs can either be pre-allocated or +managed within the graph. Graphs can contain nodes that take care of allocating +memory on the device or copying memory between the host and the device. +Whether you want to pre-allocate the memory or manage it within the graph +depends on the use-case. If the graph is executed in a tight loop the +performance is usually better when the memory is preallocated, so that it +does not need to be reallocated in every iteration. + +The same rules as for normal memory allocations apply for memory allocated and +freed by nodes, meaning that the nodes that access memory allocated in a graph +must be ordered after allocation and before freeing. + +Unlike the normal memory management API, which is controlled by host-side execution, +this enables the driver to take care of memory reuse and optimizations. +The lifetime of memory allocated in a graph begins when the execution reaches the +node allocating the memory, and ends when either reaching the corresponding +free node within the graph, or after graph execution when a corresponding +:cpp:func:`hipFreeAsync` or :cpp:func:`hipFree` call is reached. +The memory can also be freed with a free node in a different graph that is +associated with the same memory address. + +These memory allocations can also be set up to allow access from multiple GPUs, +just like normal allocations. HIP then takes care of allocating and mapping the +memory to the GPUs. When capturing a graph from a stream, the node sets the +accessibility according to ``hipMemPoolSetAccess`` at the time of capturing. + + +Capture graphs from a stream +================================================================================ + +The easy way to integrate HIP graphs into already existing code is to use +``hipStreamBeginCapture`` and ``hipStreamEndCapture`` to obtain a ``hipGraph_t`` +graph template that includes the captured operations. When starting to capture operations for a graph using ``hipStreamBeginCapture``, the operations assigned to the stream are captured into a graph instead of being -executed. That graph is returned when calling ``hipStreamEndCapture``, which +executed. The associated graph is returned when calling ``hipStreamEndCapture``, which also stops capturing operations. +In order to capture to an already existing graph use ``hipStreamBeginCaptureToGraph``. + +The functions assigned to the capturing stream are not executed, but instead are +captured and defined as nodes in the graph, to be run when the graph is +instantiated and launched. -The following code is an example of how to use the HIP graph API to capture a graph from a stream +Only functions that are launched to the stream are captured, meaning that +instructions on the host, that the GPU operations depend on, need to be scheduled +using ``hipLaunchHostFunc``, as shown in the following code example. +Host functions that are not executed using ``hipLaunchHostFunc`` will be executed as +encountered and not captured. + +Synchronous HIP API calls that are implicitly assigned to the default stream are +not permitted while capturing a stream , and will return an error. This is +because they implicitly synchronize and cause a dependency that can not be +captured within the stream. This includes functions like ``hipMalloc``, +``hipMemcpy`` and ``hipFree``. In order to capture these to the stream, replace +them with the corresponding asynchronous calls like ``hipMallocAsync``, ``hipMemcpyAsync`` or ``hipFreeAsync``. + +The general flow for using stream capture in order to get a graph template is: + +#. Create a stream from which to capture the operations +#. Call ``hipStreamBeginCapture`` before the first operation to be captured +#. Call ``hipStreamEndCapture`` after the last operation to be captured + #. Define a ``hipGraph_t`` graph template to which ``hipStreamEndCapture`` + passes the captured graph + +The following code is an example of how to use the HIP graph API to capture a +graph from a stream. .. code-block:: cpp #include #include - #define HIP_CHECK(c){if(c != hipSuccess) return -1;} - __global__ void kernelA(double* arrayA, size_t size); __global__ void kernelB(int* arrayB, size_t size); - __global__ void kernelC(double* arrayA, int* arrayB, size_t size); + __global__ void kernelC(double* arrayA, const int* arrayB, size_t size); + + struct set_vector_args{ + std::vector& h_array; + double value; + }; - int main(){ + void set_vector(void* args){ + set_vector_args h_args{*(reinterpret_cast(args))}; - size_t array_size = 1U << 20; - int numOfBlocks = 1024; - int threadsPerBlock = 1024; + std::vector& vec{h_args.h_array}; + vec.assign(vec.size(), h_args.value); + } + + void stream_capture_example(){ + constexpr int numOfBlocks = 1024; + constexpr int threadsPerBlock = 1024; + constexpr size_t arraySize = 1U << 20; + + // This example assumes that kernelA operates on data that needs to be initialized on + // and copied from the host, while kernelB initializes the array that is passed to it. + // Both arrays are then used as input to kernelC, where arrayA is also used as + // output, that is copied back to the host, while arrayB is only read from and not modified. double* d_arrayA; int* d_arrayB; - std::vector h_array(array_size); - - HIP_CHECK(hipMalloc(&d_arrayA, array_size * sizeof(*d_arrayA))); - HIP_CHECK(hipMalloc(&d_arrayB, array_size * sizeof(*d_arrayB))); + std::vector h_array(arraySize); + constexpr double initValue = 2.0; hipStream_t captureStream; HIP_CHECK(hipStreamCreate(&captureStream)); - // Start capturing the operations + + // Start capturing the operations assigned to the stream HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal)); - HIP_CHECK(hipMemcpy(d_arrayA, &h_array, array_size * sizeof(*d_arrayA), hipMemcpyHostToDevice)); + // hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream + HIP_CHECK(hipMallocAsync(&d_arrayA, arraySize*sizeof(double), captureStream)); + HIP_CHECK(hipMallocAsync(&d_arrayB, arraySize*sizeof(int), captureStream)); + + // Assign host function to the stream + // Needs a custom struct to pass the arguments + set_vector_args args{h_array, initValue}; + HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args)); - kernelA<<>>(d_arrayA, array_size); - kernelB<<>>(d_arrayB, array_size); + HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream)); - HIP_CHECK(hipDeviceSynchronize()); + kernelA<<>>(d_arrayA, arraySize); + kernelB<<>>(d_arrayB, arraySize); + kernelC<<>>(d_arrayA, d_arrayB, arraySize); - kernelC<<>>(d_arrayA, d_arrayB, array_size); + HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream)); - HIP_CHECK(hipMemcpy(&h_array, d_arrayA, array_size * sizeof(*d_arrayA), hipMemcpyDeviceToHost)); + HIP_CHECK(hipFreeAsync(d_arrayA, captureStream)); + HIP_CHECK(hipFreeAsync(d_arrayB, captureStream)); + // Stop capturing hipGraph_t graph; HIP_CHECK(hipStreamEndCapture(captureStream, &graph)); - // Create an executable graph from the captured graph. + // Create an executable graph from the captured graph hipGraphExec_t graphExec; HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0)); + // The graph template can be deleted after the instantiation if it's not needed for later use + HIP_CHECK(hipGraphDestroy(graph)); + // Actually launch the graph. The stream does not have // to be the same as the one used for capturing. HIP_CHECK(hipGraphLaunch(graphExec, captureStream)); + // Free graph and stream resources after usage HIP_CHECK(hipGraphExecDestroy(graphExec)); - HIP_CHECK(hipFree(d_arrayA)); - HIP_CHECK(hipFree(d_arrayB)); - HIP_CHECK(hipGraphDestroy(graph)); HIP_CHECK(hipStreamDestroy(captureStream)); } -Direct graph creation -================================= -Graphs can also be created directly using the HIP graph API, giving more fine-grained control over the graph. -The nodes are represented by ``hipGraphNode_t``, and the specific parameters -have a separate type each, e.g. ``hipKernelNodeParams``. Depending on the -operation, the function to call for adding the node varies. For kernel nodes -it is ``hipGraphAddKernelNode``, or for memory copies it is ``hipGraphAddMemcpyNode``. -For a full list see the :doc:`HIP graph API documentation<../doxygen/html/group___graph>`. +Explicit graph creation +================================================================================ + +Graphs can also be created directly using the HIP graph API, giving more +fine-grained control over the graph. In this case, the graph nodes are created +explicitly, together with their parameters and dependencies, which specify the +edges of the graph, thereby forming the graph structure. + +The nodes are represented by the generic ``hipGraphNode_t`` type. The actual +node type is implicitly defined by the specific function used to add the node to +the graph, for example ``hipGraphAddKernelNode``. See the +:doc:`HIP graph API documentation<../doxygen/html/group___graph>` for the +available functions, they are of type ``hipGraphAdd{Type}Node``. Each type of +node also has a predefined set of parameters depending on the operation, for +example ``hipKernelNodeParams`` for a kernel launch. See the +:doc:`documentation for the general hipGraphNodeParams type<../doxygen/html/structhip_graph_node_params>` +for a list of available parameter types and their members. + +The general flow for explicitly creating a graph is usually: + +#. Create a graph ``hipGraph_t`` +#. Create the nodes and their parameters and adding them to the graph + #. Define a ``hipGraphNode_t`` + #. Define the parameter struct for the desired operation, by explicitly setting the appropriate struct's members. + #. Use the appropriate ``hipGraphAdd{Type}Node`` function to add the node to the graph. + #. The dependencies can be defined when adding the node to the graph, or afterwards by using ``hipGraphAddDependencies`` + +The following code example demonstrates how to explicitly create nodes in order to create a graph. .. code-block:: cpp #include #include - #define HIP_CHECK(c){if(c != hipSuccess) return -1;} - __global__ void kernelA(double* arrayA, size_t size); __global__ void kernelB(int* arrayB, size_t size); - __global__ void kernelC(double* arrayA, int* arrayB, size_t size); + __global__ void kernelC(double* arrayA, const int* arrayB, size_t size); - int main(){ + struct set_vector_args{ + std::vector& h_array; + double value; + }; - size_t array_size = 1U << 20; - int numberOfBlocks = 1024; - int threadsPerBlock = 1024; + void set_vector(void* args){ + set_vector_args h_args{*(reinterpret_cast(args))}; - double* d_arrayA; - int* d_arrayB; - std::vector h_array(array_size); + std::vector& vec{h_args.h_array}; + vec.assign(vec.size(), h_args.value); + } - HIP_CHECK(hipMalloc(&d_arrayA, array_size * sizeof(*d_arrayA))); - HIP_CHECK(hipMalloc(&d_arrayB, array_size * sizeof(*d_arrayB))); + void explicit_graph_example(){ + constexpr int numOfBlocks = 1024; + constexpr int threadsPerBlock = 1024; + size_t arraySize = 1U << 20; - // Set up parameters for kernel and copy nodes - hipKernelNodeParams kernelAParams, kernelBParams, kernelCParams; - hipMemcpy3DParms cpyToDevParams, cpyToHostParams; + // The pointers to the device memory don't need to be declared here, + // they are contained within the hipMemAllocNodeParams as the dptr member + std::vector h_array(arraySize); + constexpr double initValue = 2.0; - void* kernelAArgs[] = {static_cast(&d_arrayA), static_cast(&array_size)}; + // Create graph an empty graph + hipGraph_t graph; + HIP_CHECK(hipGraphCreate(&graph, 0)); + + // Parameters to allocate arrays + hipMemAllocNodeParams allocArrayAParams{}; + allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned; + allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice; + allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides + allocArrayAParams.bytesize = arraySize * sizeof(double); + + hipMemAllocNodeParams allocArrayBParams{}; + allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned; + allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice; + allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides + allocArrayBParams.bytesize = arraySize * sizeof(int); + + // Add the allocation nodes to the graph. They don't have any dependencies + hipGraphNode_t allocNodeA, allocNodeB; + HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams)); + HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams)); + + // Parameters for the host function + // Needs custom struct to pass the arguments + set_vector_args args{h_array, initValue}; + hipHostNodeParams hostParams{}; + hostParams.fn = set_vector; + hostParams.userData = static_cast(&args); + + // Add the host node that initializes the host array. It also doesn't have any dependencies + hipGraphNode_t hostNode; + HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams)); + + // Add memory copy node, that copies the initialized host array to the device. + // It has to wait for the host array to be initialized and the device memory to be allocated + hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode}; + hipGraphNode_t cpyToDevNode; + HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 1, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice)); + + // Parameters for kernelA + hipKernelNodeParams kernelAParams; + void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast(&arraySize)}; kernelAParams.func = reinterpret_cast(kernelA); - kernelAParams.gridDim = numberOfBlocks; + kernelAParams.gridDim = numOfBlocks; kernelAParams.blockDim = threadsPerBlock; kernelAParams.sharedMemBytes = 0; kernelAParams.kernelParams = kernelAArgs; kernelAParams.extra = nullptr; - void* kernelBArgs[] = {static_cast(&d_arrayB), static_cast(&array_size)}; - kernelBParams.func = reinterpret_cast(kernelB); - kernelAParams.gridDim = numberOfBlocks; - kernelAParams.blockDim = threadsPerBlock; - kernelAParams.sharedMemBytes = 0; - kernelAParams.kernelParams = kernelBArgs; - kernelAParams.extra = nullptr; + // Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array. + hipGraphNode_t kernelANode; + HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams)); - void* kernelCArgs[] = {static_cast(&d_arrayA), static_cast(&d_arrayB), static_cast(&array_size)}; + // Parameters for kernelB + hipKernelNodeParams kernelBParams; + void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast(&arraySize)}; + kernelBParams.func = reinterpret_cast(kernelB); + kernelBParams.gridDim = numOfBlocks; + kernelBParams.blockDim = threadsPerBlock; + kernelBParams.sharedMemBytes = 0; + kernelBParams.kernelParams = kernelBArgs; + kernelBParams.extra = nullptr; + + // Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array. + hipGraphNode_t kernelBNode; + HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams)); + + // Parameters for kernelC + hipKernelNodeParams kernelCParams; + void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast(&arraySize)}; kernelCParams.func = reinterpret_cast(kernelC); - kernelAParams.gridDim = numberOfBlocks; - kernelAParams.blockDim = threadsPerBlock; - kernelAParams.sharedMemBytes = 0; - kernelAParams.kernelParams = kernelCArgs; - kernelAParams.extra = nullptr; - - cpyToDevParams.srcArray = nullptr; - cpyToDevParams.srcPos = make_hipPos(0, 0, 0); - cpyToDevParams.srcPtr = make_hipPitchedPtr(h_array.data(), array_size * sizeof(h_array[0]), array_size, 1); - cpyToDevParams.dstArray = nullptr; - cpyToDevParams.dstPos = make_hipPos(0, 0, 0); - cpyToDevParams.dstPtr = make_hipPitchedPtr(d_arrayA, array_size * sizeof(*d_arrayA), array_size, 1); - cpyToDevParams.extent = make_hipExtent(array_size * sizeof(*d_arrayA), 1, 1); - cpyToDevParams.kind = hipMemcpyHostToDevice; - - cpyToHostParams.srcArray = nullptr; - cpyToHostParams.srcPos = make_hipPos(0, 0, 0); - cpyToHostParams.srcPtr = make_hipPitchedPtr(d_arrayA, array_size * sizeof(*d_arrayA), array_size, 1); - cpyToHostParams.dstArray = nullptr; - cpyToHostParams.dstPos = make_hipPos(0, 0, 0); - cpyToHostParams.dstPtr = make_hipPitchedPtr(h_array.data(), array_size * sizeof(h_array[0]), array_size, 1); - cpyToHostParams.extent = make_hipExtent(array_size * sizeof(*d_arrayA), 1, 1); - cpyToHostParams.kind = hipMemcpyDeviceToHost; - - // Create graph and add nodes with their respective parameters - hipGraph_t graph; - hipGraphNode_t kernelANode, kernelBNode, kernelCNode, cpyToDevNode, cpyToHostNode; - HIP_CHECK(hipGraphCreate(&graph, 0)); - - // Add copy operations - HIP_CHECK(hipGraphAddMemcpyNode(&cpyToDevNode, graph, nullptr, 0, &cpyToDevParams)); - HIP_CHECK(hipGraphAddMemcpyNode(&cpyToHostNode, graph, nullptr, 0, &cpyToHostParams)); - - // Add kernels to graph - HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, nullptr, 0, &kernelAParams)); - HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, nullptr, 0, &kernelBParams)); - HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, nullptr, 0, &kernelCParams)); - - // Add dependencies between nodes - // kernels A and B have to wait for the copy operation - HIP_CHECK(hipGraphAddDependencies(graph, &cpyToDevNode, &kernelANode, 1)); - HIP_CHECK(hipGraphAddDependencies(graph, &cpyToDevNode, &kernelBNode, 1)); - // kernel C is dependent on kernels A and B - HIP_CHECK(hipGraphAddDependencies(graph, &kernelANode, &kernelCNode, 1)); - HIP_CHECK(hipGraphAddDependencies(graph, &kernelBNode, &kernelCNode, 1)); - // The copy back to the host has to wait for kernel C to finish - HIP_CHECK(hipGraphAddDependencies(graph, &kernelCNode, &cpyToHostNode, 1)); - - // Instantiate graph the just created graph in order to execute it + kernelCParams.gridDim = numOfBlocks; + kernelCParams.blockDim = threadsPerBlock; + kernelCParams.sharedMemBytes = 0; + kernelCParams.kernelParams = kernelCArgs; + kernelCParams.extra = nullptr; + + // Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results. + hipGraphNode_t kernelCNode; + hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode}; + HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 1, &kernelCParams)); + + // Copy the results back to the host. Has to wait for kernelC to finish. + hipGraphNode_t cpyToHostNode; + HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost)); + + // Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it. + hipGraphNode_t freeNodeA; + HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr)); + // Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host. + hipGraphNode_t freeNodeB; + HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr)); + + // Instantiate the graph in order to execute it hipGraphExec_t graphExec; HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0)); - // Launch the executable graph + // The graph can be freed after the instantiation if it's not needed for other purposes + HIP_CHECK(hipGraphDestroy(graph)); + + // Actually launch the graph hipStream_t graphStream; HIP_CHECK(hipStreamCreate(&graphStream)); HIP_CHECK(hipGraphLaunch(graphExec, graphStream)); HIP_CHECK(hipGraphExecDestroy(graphExec)); - HIP_CHECK(hipFree(d_arrayA)); - HIP_CHECK(hipFree(d_arrayB)); - HIP_CHECK(hipGraphDestroy(graph)); HIP_CHECK(hipStreamDestroy(graphStream)); } diff --git a/docs/how-to/programming_manual.md b/docs/how-to/programming_manual.md index 22847adaf9..bac20c9996 100644 --- a/docs/how-to/programming_manual.md +++ b/docs/how-to/programming_manual.md @@ -146,7 +146,7 @@ For Linux developers, the link [here](https://github.com/ROCm/hip-tests/blob/dev ## HIP Graph -HIP graphs are supported. For more details, refer to the [HIP API Guide](../doxygen/html/group___graph) or the [understand section for HIP graphs](../understand/hipgraph). +HIP graphs are supported. For more details, refer to the [HIP API Guide](../doxygen/html/group___graph) or the [how-to section for HIP graphs](../how-to/hipgraph). ## Device-Side Malloc diff --git a/docs/index.md b/docs/index.md index b9c2a788e6..fed497c24d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -31,7 +31,6 @@ On non-AMD platforms, like NVIDIA, HIP provides header files required to support * {doc}`./understand/programming_model` * {doc}`./understand/hardware_implementation` -* {doc}`./understand/hipgraph` * {doc}`./understand/amd_clr` * [Texture fetching](./understand/texture_fetching) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 7a9d56480f..d59455f10b 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -17,7 +17,6 @@ subtrees: entries: - file: understand/programming_model - file: understand/hardware_implementation - - file: understand/hipgraph - file: understand/amd_clr - file: understand/texture_fetching title: Texture fetching @@ -37,7 +36,7 @@ subtrees: - file: how-to/virtual_memory title: Virtual memory - file: how-to/hipgraph - title: HIP graph + title: HIP graphs - file: how-to/faq - caption: Reference diff --git a/docs/understand/hipgraph.rst b/docs/understand/hipgraph.rst deleted file mode 100644 index 12652beb1c..0000000000 --- a/docs/understand/hipgraph.rst +++ /dev/null @@ -1,114 +0,0 @@ -.. meta:: - :description: This chapter provides an overview over the usage of HIP graph. - :keywords: ROCm, HIP, graph, stream - -.. _understand_HIP_graph: - -******************************************************************************** -HIP graph -******************************************************************************** - -.. note:: - The HIP graph API is currently in Beta. Some features can change and might - have outstanding issues. Not all features supported by CUDA graphs are yet - supported. For a list of all currently supported functions see the - :doc:`HIP graph API documentation<../doxygen/html/group___graph>`. - -A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent -the operations performed, while the edges mark dependencies between those -operations. - -The nodes can be one of the following: - -- empty nodes -- nested graphs -- kernel launches -- host-side function calls -- HIP memory functions (copy, memset, ...) -- HIP events -- signalling or waiting on external semaphores - -The following figure visualizes the concept of graphs, compared to using streams. - -.. figure:: ../data/understand/hipgraph/hip_graph.svg - :alt: Diagram depicting the difference between using streams to execute - kernels with dependencies, resolved by explicitly calling - hipDeviceSynchronize, or using graphs, where the edges denote the - dependencies. - -Node types --------------------------------------------------------------------------------- - -The available node types are specified by :cpp:enumerator:`hipGraphNodeType`. - -Memory management nodes -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Memory management nodes handle allocating and freeing of memory of a graph. -Memory management nodes can be created by using the explicit API functions, or -by capturing :cpp:func:`hipMallocAsync` or :cpp:func:`hipFreeAsync`. -Unlike the normal memory management API, which is controlled by host-side execution, -this enables HIP to take care of memory reuse and optimizations. -The lifetime of memory allocated in a graph begins when the execution reaches the -node allocating the memory, and ends when either reaching the corresponding -free node within the graph, or after graph execution with a corresponding -:cpp:func:`hipFreeAsync` call, or a corresponding :cpp:func:`hipFree` call. -The memory can also be freed with a free node in a different graph that is -associated with the same memory address. - -The same rules as for normal memory allocations apply for memory allocated and -freed by nodes, meaning that the nodes that access memory allocated in a graph -must be ordered after the allocation node and before the freeing node. - -These memory allocations can also be set up to allow access from multiple GPUs, -just like normal allocations. HIP then takes care of allocating and mapping the -memory to the GPUs. When capturing a graph from a stream, the node sets the -accessibility according to hipMemPoolSetAccess at the time of capturing. - -HIP graph advantages -================================================================================ - -The standard way of launching work on GPUs via streams incurs a small overhead -for each iteration of the operation involved. For kernels that perform large -operations during an iteration this overhead is usually negligible. However -in many workloads, such as scientific simulations and AI, a kernel performs a -small operation for many iterations, and so the overhead of launching kernels -can be a significant cost on performance. - -HIP graphs have been specifically designed to tackle this problem by only -requiring one launch from the host per iteration, and minimizing that overhead -by performing most of the initialization beforehand. Graphs can provide -additional performance benefits, by enabling optimizations that are only -possible when knowing the dependencies between the operations. - -.. figure:: ../data/understand/hipgraph/hip_graph_speedup.svg - :alt: Diagram depicting the speed up achievable with HIP graphs compared to - HIP streams when launching many short-running kernels. - - Qualitative presentation of the execution time of many short-running kernels - when launched using HIP stream versus HIP graph. This does not include the - time needed to set up the graph. - -HIP graph usage -================================================================================ - -Using HIP graphs to execute your work requires three different steps, where the -first two are the initial setup and only need to be executed once. First the -definition of the operations (nodes) and the dependencies (edges) between them. -The second step is the instantiation of the graph. This takes care of validating -and initializing the graph, to reduce the overhead when executing the graph. - -The third step is the actual execution of the graph, which then takes care of -launching all the kernels and executing the operations while respecting their -dependencies and necessary synchronizations as specified. - -As HIP graphs require some set up and initialization overhead before their first -execution, they only provide a benefit for workloads that require many iterations to complete. - -Setting up HIP graphs -================================================================================ - -HIP graphs can be created by explicitly defining them, or using stream capture. -For further information on how to use HIP graphs see :ref:`the how-to-chapter about HIP graphs`. -For the available functions see the -:doc:`HIP graph API documentation<../doxygen/html/group___graph>`.