Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MKL for verification #63

Open
wants to merge 58 commits into
base: sycl-develop
Choose a base branch
from

Conversation

AD2605
Copy link
Collaborator

@AD2605 AD2605 commented May 15, 2024

Adds support to use oneMKL for verification and verifies example 14.

aacostadiaz and others added 30 commits March 21, 2024 16:01
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
* Migrate cute sgemm_nt_1 to sycl

* Revert "Migrate cute sgemm_nt_1 to sycl"

* Revert "Revert "Migrate cute sgemm_nt_1 to sycl""

* Add cutlass example

* Update examples/14_ampere_tf32_tensorop_gemm/CMakeLists.txt

Co-authored-by: Mehdi Goli <[email protected]>

* Update examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm_cute.cpp

Co-authored-by: Mehdi Goli <[email protected]>

* Update examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm_cute.cu

Co-authored-by: Mehdi Goli <[email protected]>

* Update examples/cute/tutorial/CMakeLists.txt

Co-authored-by: Mehdi Goli <[email protected]>

* Update include/cute/arch/copy_sm80.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Update include/cute/config.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Update include/cute/util/debug.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Update include/cute/util/debug.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Update include/cutlass/detail/helper_macros.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Address comments

---------

Co-authored-by: Mehdi Goli <[email protected]>
* Remove sgemm_nt_1 PoC

* Fix build issues

* Fix code style format

* Remove ENABLE_NVPTX flag

* Update include/cute/util/debug.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Cosmetic

---------

Co-authored-by: Mehdi Goli <[email protected]>
…tware#16)

* Updating README-sycl.md to capture the 3.5 modifications

* Update README-sycl.md

Co-authored-by: aacostadiaz <[email protected]>

* Remove the sgemm_nt_1_sycl PoC (codeplaysoftware#15)

* Remove sgemm_nt_1 PoC

* Fix build issues

* Fix code style format

* Remove ENABLE_NVPTX flag

* Update include/cute/util/debug.hpp

Co-authored-by: Mehdi Goli <[email protected]>

* Cosmetic

---------

Co-authored-by: Mehdi Goli <[email protected]>

* Applying the comments

---------

Co-authored-by: aacostadiaz <[email protected]>
* Adding CuTe example in SYCL

Co-authored-by: Alejandro Acosta <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
* Add cmake configuration

* Update examples/cute/tutorial/CMakeLists.txt

Co-authored-by: Mehdi Goli <[email protected]>

---------

Co-authored-by: Mehdi Goli <[email protected]>
* Update README-sycl.md

Fixing CUDA version
Fix typo in Macro
Co-authored-by: Mehdi Goli <[email protected]>

* Cosmetic

---------

Co-authored-by: Mehdi Goli <[email protected]>

* Applying the comments

---------

Co-authored-by: aacostadiaz <[email protected]>

* Revert "Updating README-sycl.md to capture the 3.5 modifications (codeplaysoftware#16)" (codeplaysoftware#17)

This reverts commit a726bd3.

* fix typo in macro

---------

Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: aacostadiaz <[email protected]>
@AD2605 AD2605 requested a review from aacostadiaz May 15, 2024 16:17
cmake/FindDPCPP.cmake Outdated Show resolved Hide resolved
cmake/onemkl.cmake Outdated Show resolved Hide resolved
examples/CMakeLists.txt Outdated Show resolved Hide resolved
tools/util/include/cutlass/util/device_memory.h Outdated Show resolved Hide resolved
cmake/onemkl.cmake Outdated Show resolved Hide resolved
Copy link
Collaborator

@aacostadiaz aacostadiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@@ -42,6 +42,10 @@
#include "cutlass/util/device_memory.h"
#include "helper.h"

#if defined(CUTLASS_ENABLE_MKL)
# include <oneapi/mkl.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are building oneMKL interface for Nvidia GPU, you are linking it to the mkl close source for Intel even if it is on Nvida? what is your mechanism to have a correct linking and make sure that the oneMKL does not run on Intel CPU instead of calling cublas behind the scene to go to GPU. If you are always testing it on the IntelCPU there is no need to have this :

  -DENABLE_MKLGPU_BACKEND=${ENABLE_MKLGPU_BACKEND}
    -DENABLE_CUBLAS_BACKEND=${ENABLE_CUBLAS_BACKEND}
    ```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is your mechanism to have a correct linking and make sure that the oneMKL does not run on Intel CPU

From what I understood about the oneMKL interface, the dispatch is in accordance to the queue. Whether it is a GPU or a CPU queue, and if a GPU queue, then what backend it will select, depending on the Device.
Since our queue will never be a CPU queue, It will never go to the CPU BLAS side of things.

Copy link
Collaborator

@mehdi-goli mehdi-goli May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oneMKL does not have all the backend unless you provide, for example if you dont give a correct pass to the backed there is no GPU or nvidia backend. OneMKL interface is a wrapper over 3 close source actual implementation, meaning oneMKL product, cuBLAs and rocBLAS. if the backend is not available it does not link to the correct one. the oneMKL close source product is only for intel CPU and GPU, so when you are running on Nvidia GPU you dont need the oneMKL product backend at all, similarly, when running on Intel GPU you dont need the cublas Backend at all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the backend is not available it does not link to the correct one. the oneMKL close source product is only for intel CPU and GPU, so when you are running on Nvidia GPU you dont need the oneMKL product backend at all, similarly, when running on Intel GPU you dont need the cublas Backend at all

Only one backend is being built. CPU backend is explicitly turned off, and either of the CUBLAS or the MKL_GPU (closed source oneMKL) backend is being built, here, in accordance to the SYCL target being passed.

#if defined(CUTLASS_ENABLE_MKL)
auto d_CRef = syclcompat::malloc<TC>(h_C.size());
auto d_hRef = std::vector<TC>(h_C.size());
auto queue = syclcompat::get_default_queue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the default queue in here, how can you make sure the queue is nvidia gpu queue. This test that you are running is specifically set for Amper device, and needed to be compiled only for that device due to definition of the architecture at compile time. how do you make sure that you get a correct queue in all systems when you use default here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test that you are running is specifically set for Amper device, and needed to be compiled only for that device due to definition of the architecture at compile time.

This example only builds if SYCL_NVIDIA_TARGET is defined (here).

how can you make sure the queue is nvidia gpu queue

Since we do not explicitly create a queue, or pass any sort of device selection of to syclcompat, I would expect that the device selector is being set when calling the binary assuming multiple backends on the platform.

The reason I use the default_queue from sycl compat is because that's being used throughout, for memory allocations etc...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selector should be passed somehow probably through SYCLcompat, sycl default queue does not guaranty selection of any devices and you wont have any control when multiple GPU is available

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create a queue using the GPU device selector and then use syclcompat::set_default_queue. After that, you can add a check like the one in the cuda example.

  cudaDeviceProp props;

  cudaError_t error = cudaGetDeviceProperties(&props, 0);
  if (error != cudaSuccess) {
    std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
    return -1;
  }

  if (!((props.major * 10 + props.minor) >= 80)) {
    std::cerr << "Ampere Tensor Core operations must be run on a machine with compute capability at least 80."
              << std::endl;
    notSupported = true;
  }

  if (notSupported) {
    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
    return 0;
  }

examples/CMakeLists.txt Outdated Show resolved Hide resolved
AD2605 and others added 4 commits May 28, 2024 08:18
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants