Develop (#86)

* Jiaruifang/update readme (#14) update readme set PyTorch version as 1.4.0 use 0.2.0 as turbo version * Jiaruifang/update readme (#16) upgrade onnxrt to v1.2.0 in dev cpu docker add how to use dockerhub get a CPU version * Jiaruifang/update readme (#17) * update readme * update readme * add benchmark compared with gemmlowp * return hidden_states from BertModel * update p40 speedup fig * add github action hook for branch develop * revert to back a mat mul benchmark unitest * Set PyTorch cpu version as 1.4.0 * fix a typo * rm torchvision in docker_ci * update readme and use 0.2.0 as version * upgrade onnxrt to v1.2.0 in dev cpu docker add how to use dockerhub get a CPU version * fix a typo * delete turbotransformer. and add a blank line in readme (#20) * remove duplicated licenses comments. update readme, more accuately describing variable-length for onnxruntime. * Jiaruifang/polish (#30) remove duplicated licenses comments. update readme, more accurately describing variable-length for onnxruntime. * because add hidden state in bert layer, fix it in sequence classification (#36) * Jiaruifang/amd blis (#69) add blis support for AMD cpus. * Jiaruifang/decoder gpu allocator (#85) * Jiaruifang/multi head attn (#29) Add a more functional multiheadedattention. Add positionwise-feed-forward. Add multiheadedattention. * Jiaruifang/transformer decoder layer (#32) add TransformerDecoderLayer * Jiaruifang/transformer decoder layer (#33) * add TransformerDecoderLayer * check multi headed attn's max_relative_positions be 0 * Jiaruifang/transformer decoder (#35) fix multi_headed_attention_test.py bug * Jiaruifang/fixbug multiheadedattn (#40) * add attn as return values for decoder * check attns in decoder_transformer_decoder_layer_test * fix multi_headed_attention_test.py bug * add set_stderr_verbose_level python interface * add profiling method for decoder_multi_headed_attn_test * fix bugs in multiheadedattn cased by mask * option of WITH_PROFILER in CMakeLists set as OFF * fix bug for profiler * Jiaruifang/weight trans ffn (#43) * profile ffn. tuned weight transpose for intel 61xx * finetuned multi_headed_attention layer * fix some bugs. * Jiaruifang/merge bert multiheaded attn (#49) use multiheaded attn to do bert attention * Jiaruifang/gpu decoder (#51) add gpu transformer decoder implementation. using cub::cachingallocator still has some bugs to be fixed. performance to be tuned. * add layernorm support for multi heade attn from_torch * fix a bug in from_torch of MultiHeadedAttention * fix bugs from attn masks in transformer decoder layer. (#64) * fix bugs from attn masks in transformer decoder layer. * polish code * Jiaruifang/debug decoder layer mask (#68) transformere decoder mask float -> bool make multiheaded attn is able to get layer_cache as input parameter. add layer_cache for self attn. self attn layer_cache * softmax supports 3D mask (#72) gpu softmax support 3D mask. * Develop (#74) Add blis support for AMD cpus. * init best fit cuda allocator. * fix a bug of GetInstance * TODO remove temp tensor * remove temp tensor. * fix a bug * add cuda allocator unitests. * fix a bug in best fit cuda allocator. * more unitests for cuda allocator. * a wrong verion, all gpu unitests do not pass. * add comments for best fit and upgrade release version. * merge decoder and best fit cuda memory allocator. * update readme * Jiaruifang/cpu allocator (#88) * Develop (#74) Add blis support for AMD cpus. * add cpu best fit allocator. * Jiaruifang/debug decoder layer mask (#89) * add cpu best fit allocator. * fix a bug in allocator test. * fix tgt_pad_mask bug * update README * revert back to cub allocator * Jiaruifang/benchmark amd blas (#90) * Develop (#74) Add blis support for AMD cpus. * Polish the benchmark code for BLAS on AMD CPU. * add general GEMM benchmark. * show blas type in matmul_benchmark * Jiaruifang/gpu timer (#91) * add gpu profiler. * fix a bug caused by attn_score in bert attention. * fix attn_score bug. * Jiaruifang/gpu concat (#92) * add gpu profiler. * fix a bug caused by attn_score in bert attention. * fix attn_score bug. * accelerate GPU concat * add loss file * Jiaruifang/profiler kernels (#97) * add gpu profiler. * fix a bug caused by attn_score in bert attention. * fix attn_score bug. * accelerate GPU concat * add loss file * print profiling result in increasing order. Fix the best fit cuda allocator bug. * move profiler into functions. * Jiaruifang/fix bestfit bug (#98) * Develop (#74) Add blis support for AMD cpus. * fix a bug in cpp mask (#95) * Fix bestfit allocator bug. * Update readme * Jiaruifang/fix bestfit bug (#99) * Develop (#74) Add blis support for AMD cpus. * fix a bug in cpp mask (#95) * Fix bestfit allocator bug. * Update readme * add a missing file. * update readme, and fix attn score bug in bert_attn (#100) * update readme, and fix attn score bug in bert_attn * fix shared ptr bug. * fix cuda c++11 bug. * Jiaruifang/decoder readme (#101) * update readme, and fix attn score bug in bert_attn * fix shared ptr bug. * fix cuda c++11 bug. * Update Readme Co-authored-by: shicheng <[email protected]>
Tencent · Jun 28, 2020 · 72097bf · 72097bf
1 parent af84878
commit 72097bf
Show file tree

Hide file tree

Showing 74 changed files with 3,788 additions and 630 deletions.
diff --git a/.github/workflows/dockerimage.yml b/.github/workflows/dockerimage.yml
@@ -5,10 +5,13 @@ on:
     branches:
         - master
         - develop
+        - feature/decoder
   pull_request:
     branches:
         - master
         - develop
+        - feature/decoder
+
 
 jobs:
   build:

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -21,9 +21,9 @@ set(CMAKE_CXX_STANDARD 14)
 set(CMAKE_CXX_FLAGS "-Wall")
 set(CMAKE_C_FLAGS "-Wall")
 
-set(TURBO_TRANSFORMERS_VERSION 0.2.1)
+set(TURBO_TRANSFORMERS_VERSION 0.3.0)
 
-option(WITH_PROFILER  "Compile with gperftools" OFF)
+option(WITH_PROFILER  "Compile with profiler" OFF)
 option(WITH_GPU       "Build with GPU"          OFF)
 option(WITH_MODULE_BENCHMAKR       "Catch2 unitest with benchmarking"          ON)
 
@@ -65,9 +65,7 @@ endif ()
 
 
 if (WITH_PROFILER)
-    find_package(Gperftools REQUIRED)
-    include_directories(${GPERFTOOLS_INCLUDE_DIR})
-    add_definitions(-DWITH_GPERFTOOLS)
+    add_definitions(-DWITH_PERFTOOLS)
 endif ()
 
 IF (UNIX AND NOT APPLE)

diff --git a/Dockerfile_ci b/Dockerfile_ci
@@ -1,14 +1,15 @@
-FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
+FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04
 
 RUN apt-get update && \
-    apt-get install -y curl git wget bzip2 build-essential ninja-build g++ && rm -rf /var/lib/apt/lists/*
+    apt-get install -y curl git wget bzip2 build-essential ninja-build g++ gfortran && rm -rf /var/lib/apt/lists/*
 
 ENV PATH=/opt/miniconda3/bin:${PATH} CONDA_PREFIX=/opt/miniconda3
 RUN curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
     bash Miniconda3-latest-Linux-x86_64.sh -p /opt/miniconda3 -b && \
     rm Miniconda3-latest-Linux-x86_64.sh && \
     conda update -y conda && \
-    conda install pytorch==1.4.0 cudatoolkit=10.0 && \
+    conda install pytorch==1.5.0 cudatoolkit=10.0 && \
+    pip install OpenNMT-py && \
     conda install curl conda-verify conda-build mkl-include cmake -c anaconda && \
     conda install git git-lfs docopt -c conda-forge  && \
     conda clean -afy

diff --git a/README.md b/README.md
@@ -1,12 +1,14 @@
 ## turbo_transformers: a fast and user-friendly tool for transformer inference on CPU and GPU
 ![logo](./images/logo.jpeg)
 
-### **make transformers serving fast by adding a turbo to your inference engine!**
+<center>**make transformers serving fast by adding a turbo to your inference engine!**</center>
 
+### Background
 Transformer is the most critical alogrithm innovation in the NLP field in recent years. It brings higher model accuracy while introduces more calculations. The efficient deployment of online Transformer-based services faces enormous challenges. In order to make the costly Transformer online service more efficient, the WeChat AI open-sourced a Transformer inference acceleration tool called TurboTransformers, which has the following characteristics.
-1. Excellent CPU / GPU performance. For Intel multi-core CPU and NVIDIA GPU hardware platforms, TurboTransformers can fully utilize all levels of computing power of the hardware. It has achieved better performance over pytorch / tensorflow and current mainstream optimization engines (such as onnxruntime-mkldnn / onnxruntime-gpu, torch JIT, NVIDIA faster transformers) on a variety of CPU and GPU hardware. See the detailed benchmark results below.
-2. Tailored to the characteristics of NLP inference tasks. Unlike the CV task, the input dimensions of the NLP inference task always change. The traditional approach is zero padding or truncation to a fixed length, which introduces additional zero padding computational overhead. Besides, some frameworks such as onnxruntime, tensorRT, and torchlib need to preprocess the compuatation-graph according to the input size in advance for the best performance, which is not suitable for NLP tasks with varying sizes. TurboTransformers can support variable-length input sequence processing without preprocessing.
-3. A simpler method of use. TurboTransformers supports python and C ++ interface for calling. It can be used as an acceleration plug-in for pytorch. In the Transformer task, the end-to-end acceleration effect obtained by adding a few lines of python code.
+1. Supporting both Transformers Encoder and Decoder.
+2. Excellent CPU / GPU performance. For Intel multi-core CPU and NVIDIA GPU hardware platforms, TurboTransformers can fully utilize all levels of computing power of the hardware. It has achieved better performance over pytorch / tensorflow and current mainstream optimization engines (such as onnxruntime-mkldnn / onnxruntime-gpu, torch JIT, NVIDIA faster transformers) on a variety of CPU and GPU hardware. See the detailed benchmark results below.
+3. Tailored to the characteristics of NLP inference tasks. Unlike the CV task, the input dimensions of the NLP inference task always change. The traditional approach is zero padding or truncation to a fixed length, which introduces additional zero padding computational overhead. Besides, some frameworks such as onnxruntime, tensorRT, and torchlib need to preprocess the compuatation-graph according to the input size in advance for the best performance, which is not suitable for NLP tasks with varying sizes. TurboTransformers can support variable-length input sequence processing without preprocessing.
+4. A simpler method of use. TurboTransformers supports python and C++ interface for calling. It can be used as an acceleration plug-in for pytorch. In the Transformer task, the end-to-end acceleration effect obtained by adding a few lines of python code.
 
 TurboTransformers has been applied to multiple online BERT service scenarios in Tencent. For example, It brings 1.88x acceleration to the WeChat FAQ service, 2.11x acceleration to the public cloud sentiment analysis service, and 13.6x acceleration to the QQ recommendation system.
 
@@ -45,7 +47,7 @@ sh tools/build_and_run_unittests.sh $PWD -DWITH_GPU=OFF
 # set(BLAS_PROVIDER "mkl" CACHE STRING "Set the blas provider library, in [openblas, mkl, blis]")
 
 ```
-Method 2：I do not want to unitest
+Method 2: I do not want to unitest
 ```
 cd /workspace
 mkdir -p build && cd build
@@ -67,7 +69,7 @@ sh tool/build_conda_package.sh
 
 *We also prepared a docker image containing CPU version of TurboTransformers, as well as other related works, i.e. onnxrt v1.2.0 and pytorch-jit on dockerhub*
 ```
-docker pull thufeifeibear/turbo_transformers:0.2.0-release-cpu-dev
+docker pull thufeifeibear/turbo_transformers:0.3.0-cpu-dev
 ```
 ### Installation on GPU
 ```
@@ -77,8 +79,8 @@ git clone https://github.com/Tencent/TurboTransformers --recursive
 ```
 # You can modify the environment variables in the script to specify the cuda version and operating system version
 sh tools/build_docker_gpu.sh $PWD
-docker run --gpus all --net=host --rm -it -v $PWD:/workspace -v /etc/passwd:/etc/passwd --name=your_container_name REPOSITORY:TAG
-# for example: docker run --gpus all --net=host --rm -it -v $PWD:/workspace -v /etc/passwd:/etc/passwd --name=jiarui_gpu_env ccr.ccs.tencentyun.com/mmspr/turbo_transformers:0.1.1-cuda9.0-ubuntu16.04-gpu-dev
+nvidia-docker run --gpus all --net=host --rm -it -v $PWD:/workspace -v /etc/passwd:/etc/passwd --name=your_container_name REPOSITORY:TAG
+# for example: nvidia-docker run --gpus all --net=host --rm -it -v $PWD:/workspace -v /etc/passwd:/etc/passwd --name=jiarui_gpu_env ccr.ccs.tencentyun.com/mmspr/turbo_transformers:0.1.1-cuda9.0-ubuntu16.04-gpu-dev
 ```
 
 2. Install pip package in docker and unitest test
@@ -92,68 +94,42 @@ sh tools/build_and_run_unittests.sh $PWD -DWITH_GPU=ON
 cd benchmark
 bash gpu_run_benchmark.sh
 ```
-*We also prepared a docker image containing GPU version of TurboTransformers.
+We also prepared a docker image containing GPU version of TurboTransformers.
 ```
-docker pull thufeifeibear/turbo_transformers:0.2.0-cuda10.0-cudnn7-devel-ubuntu18.04-gpu-release
+docker pull thufeifeibear/turbo_transformers:0.3.0-cuda10.0-cudnn7-devel-ubuntu18.04-gpu-dev
 ```
 
 ### Usage
-turbo_transformers provides C ++ / python API interfaces. we hope to do our best to adapt to a variety of online environments to reduce the difficulty of development for users.
+TurboTransformers provides C++ / python API interfaces. We hope to do our best to adapt to a variety of online environments to reduce the difficulty of development for users.
 
 The first step in using turbo is to load a pre-trained model. We provide a way to load pytorch and tensorflow pre-trained models in [huggingface/transformers](https://github.com/huggingface).
 The specific conversion method is to use the corresponding script in ./tools to convert the pre-trained model into an npz format file, and turbo uses the C ++ or python interface to load the npz format model.
 In particular, we consider that most of the pre-trained models are in pytorch format and used with python. We provide a shortcut for calling directly in python for the pytorch saved model.
 
-<img width="700" height="150" src="./images/pretrainmodelload.jpg" alt="加载预训练模型">
+<img width="700" height="150" src="./images/pretrainmodelload.jpg" alt="pretrained">
 
-#### python APIs
+#### Bert Examples
+##### python APIs
 Refer to examples in [./example/python](./example/python "python").
 Since the user of BERT acceleration always requires a customized post-processing process for the task, we provide an example of how to write a sequence classification application.
-#### C++ APIs
+##### C++ APIs
 Refer to [./example/cpp](./example/cpp "C ++") for an example.
 Our example provides the GPU and two CPU multi-thread calling methods. One is to do one BERT inference using multiple threads; the other is to do multiple BERT inference, each of which using one thread.
 Users can link turbo-transformers to your code through add_subdirectory.
 
-## Performance
-### CPU
-We tested the performance of TurboTransformers on three CPU hardware platforms.
-We choose [pytorch](https://github.com/huggingface "pytorch"), [pytorch-jit](https://pytorch.org/docs/stable/_modules/torch/jit.html "pytorch-jit" ) and [onnxruntime-mkldnn](https://github.com/microsoft/onnxruntime "onnxruntime-mkldnn") and TensorRT implementation as a comparison. The performance test result is the average of 150 iterations. In order to avoid the phenomenon that the data of the last iteration is cached in the cache during multiple tests, each test uses random data and refreshes the cache data after calculation.
-* Intel Xeon 61xx
-
-<img width="900" height="300" src="./images/61xx_perf_thd48_0415.jpg" alt="61xx性能">
-<img width="900" height="300" src="./images/61xx_speedup_thd48_0415.jpg" alt="61xx加速">
-
-* Intel Xeon 6133
-Compared to the 61xx model, Intel Xeon 6133 has a longer vectorized length of 512 bits, and it has a 30 MB shared L3 cache between cores.
-
-<img width="900" height="300" src="./images/6133_perf_thd48_0415.jpg" alt="6133性能">
-<img width="900" height="300" src="./images/6133_speedup_thd48_0415.jpg" alt="6133加速">
-
-### GPU
-We tested the performance of turbo_transformers on four GPU hardware platforms.
-We choose [pytorch](https://github.com/huggingface "pytorch"), [NVIDIA Faster Transformers](https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer "FasterTransformer"), [onnxruntime-gpu](https://github.com/microsoft/onnxruntime "onnxrt-gpu") and [TensorRT](https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT) implementation as a comparison. The performance test result is the average of 150 iterations.
-
-* RTX 2060
-<img width="900" height="300" src="./images/2060-perf.jpg" alt="2060性能">
-<img width="900" height="300" src="./images/2060-speedup.jpg" alt="2060加速">
+#### Decoder Examples
+[TurboNLP/Translate-Demo](https://github.com/TurboNLP/Translate-Demo "translate") shows a demo of applying TurboTransformer in Translatetion Task.
 
-* Tesla V100
-
-<img width="900" height="300" src="./images/v100-perf.jpg" alt="V100性能">
-<img width="900" height="300" src="./images/V100-speedup.jpg" alt="V100加速">
-
-* Tesla P40
-
-<img width="900" height="300" src="./images/p40-perf.jpg" alt="P40性能">
-<img width="900" height="300" src="./images/p40-speedup.jpg" alt="P40加速">
+## Performance
+[BERT Benchmark Results](./docs/bert.md)
 
-* Tesla M40
+[Transformer Docoder Results](./docs/decoder.md)
 
-<img width="900" height="300" src="./images/M40-perf-0302.jpg" alt="M40性能">
-<img width="900" height="300" src="./images/M40-speedup-0302.jpg" alt="M40加速">
+[How to know hotspots of your code](./docs/profiler.md)
 
 ## TODO
-Currently (April 2020), we only support a interface of the BERT encoder model using FP32. In the near futuer, we will add support for other models (GPT2, decoders, etc.) and low-precision floating point (CPU int8, GPU FP16).
+Currently (June 2020), In the near futuer, we will add support for other models (Albert [Work In Progress], GPT2) and low-precision floating point (CPU int8, GPU FP16).
+**Looking forwards to your contribution!**
 
 ## Lisence
 BSD 3-Clause License
@@ -162,11 +138,10 @@ BSD 3-Clause License
 1. The results of Turbo Transformers may be different from the results of PyTorch after 2 digits behind the decimal point.
 The diff mainly comes from Bert Output Layer. We use a approximate GELU algorithm, which may be different from PyTorch.
 
-2. On AuthenticAMD CPU, member function `from_torch` of class `BertModelWithPooler` and `BertModel` does not support PyTorch version as 1.5.0.
-In our opinion, the tensor transpose API of PyTorch is not stable. We use the following way to transpose weight matrices.
-```
-weight = torch.clone(torch.t(pooler_params['dense.weight']))
-```
+### History
+1. April 2020 v0.0.1, TurboTransformers released, and achieved state-of-the-art BERT inference speed on CPU/GPU.
+2. June 2020 v0.2.1, TurboTransformers add BLIS as a BLAS option. Better performance on AMD CPU.
+3. June 2020 v0.3.0, TurboTransformers adds support for Transformer Decoder on CPU/GPU.
 
 ## Contact us
 Although we recommand you post your problem with github issues, you can also join in our Turbo user group.

diff --git a/benchmark/run_gpu_benchmark.sh b/benchmark/run_gpu_benchmark.sh
@@ -18,6 +18,7 @@ FRAMEWORKS=("turbo-transformers" "torch")
 # FRAMEWORKS=("onnxruntime")
 SEQ_LEN=(10 20 40 60 80 100 200 300 400 500)
 BATCH_SIZE=(1 20)
+
 N=150
 MODEL="bert-base-chinese"
 for batch_size in ${BATCH_SIZE[*]}

diff --git a/benchmark/turbo_transformers/layers/kernels/matmul_benchmark.cpp b/benchmark/turbo_transformers/layers/kernels/matmul_benchmark.cpp
@@ -27,7 +27,7 @@ using layers::kernels::common::FillRandom;
 static void MatmulBenchmarkHelper(DLDeviceType device_type, bool trans_weight,
                                   std::initializer_list<int64_t> weight_shape,
                                   std::vector<int64_t> m_list) {
-  constexpr int n_step = 1000;
+  constexpr int n_step = 100;
   const std::string device_name = device_type == kDLCPU ? "CPU" : "GPU";
   const std::string trans_name = trans_weight ? "Tran" : "NoTrans";
 
@@ -61,27 +61,15 @@ static void MatmulBenchmarkHelper(DLDeviceType device_type, bool trans_weight,
     ss << device_name << " " << trans_name << " MatMul " << m << ", " << k
        << ", " << n << " ";
     auto g_flops = m * n * k * 2 / 1e9;
-
-    if (device_type == kDLGPU) {
-#ifdef TT_WITH_CUDA
-      auto flops = benchmark::TestFuncSpeed(
-          [&]() {
-            layers::kernels::MatMul(input_tensor, false, weight_tensor,
-                                    trans_weight, 1.0, &output_tensor, 0.0);
-          },
-          n_step, ss.str(), g_flops, device_type);
-
-      std::cout << ss.str() << " flops: " << flops << std::endl;
-#endif
-    } else {
-      benchmark::TestFuncSpeed(
-          [&]() {
-            layers::kernels::MatMul(input_tensor, false, weight_tensor,
-                                    trans_weight, 1.0, &output_tensor, 0.0);
-          },
-          n_step, ss.str(), g_flops, device_type);
-    }
-  }
+    auto flops = benchmark::TestFuncSpeed(
+        [&]() {
+          layers::kernels::MatMul(input_tensor, false, weight_tensor,
+                                  trans_weight, 1.0, &output_tensor, 0.0);
+        },
+        n_step, ss.str(), g_flops, device_type);
+
+    std::cout << ss.str() << " flops: " << flops << std::endl;
+  }  // for
 }
 
 TEST_CASE("matmal-cpu-benchmark") {
@@ -93,6 +81,62 @@ TEST_CASE("matmal-cpu-benchmark") {
   std::cout << std::endl;
 }
 
+static void MatmulBenchmarkGeneralHelper(DLDeviceType device_type,
+                                         bool trans_weight,
+                                         std::vector<int64_t> dim_list) {
+  constexpr int n_step = 1000;
+  const std::string device_name = device_type == kDLCPU ? "CPU" : "GPU";
+  const std::string trans_name = trans_weight ? "Trans" : "NoTrans";
+
+  for (auto m : dim_list) {
+    std::initializer_list<int64_t> input_shape{m, m};
+    std::initializer_list<int64_t> weight_shape{m, m};
+    std::initializer_list<int64_t> output_shape{m, m};
+
+    using turbo_transformers::core::NewDLPackTensorT;
+
+    core::Tensor input_tensor(
+        NewDLPackTensorT<float>(input_shape, device_type, 0));
+    FillRandom<float>(input_tensor);
+
+    core::Tensor weight_tensor(
+        NewDLPackTensorT<float>(weight_shape, device_type, 0));
+    FillRandom<float>(weight_tensor);
+
+    core::Tensor output_tensor(
+        NewDLPackTensorT<float>(output_shape, device_type, 0));
+    FillRandom<float>(output_tensor);
+
+    std::stringstream ss;
+    ss << device_name << " " << trans_name << " MatMul " << m << ", " << m
+       << ", " << m << " ";
+    auto g_flops = m * m * m * 2 / 1e9;
+    auto flops = benchmark::TestFuncSpeed(
+        [&]() {
+          layers::kernels::MatMul(input_tensor, false, weight_tensor,
+                                  trans_weight, 1.0, &output_tensor, 0.0);
+        },
+        n_step, ss.str(), g_flops, device_type);
+
+    std::cout << ss.str() << " flops: " << flops << std::endl;
+  }  // for
+}
+
+TEST_CASE("matmal-cpu-benchmark-general") {
+#if defined(TT_BLAS_USE_MKL)
+  std::cout << "blas uses MKL" << std::endl;
+#elif defined(TT_BLAS_USE_OPENBLAS)
+  std::cout << "blas uses OpenBLAS" << std::endl;
+#elif defined(TT_BLAS_USE_BLIS)
+  std::cout << "blas uses BLIS" << std::endl;
+#endif
+  std::cout << "=================================" << std::endl;
+  std::cout << "CPU General MatMul Benchmark" << std::endl;
+  std::vector<int64_t> dim_list{10, 50, 100, 500, 1000, 1500, 2000};
+  MatmulBenchmarkGeneralHelper(kDLCPU, false, dim_list);
+  std::cout << std::endl;
+}
+
 #ifdef TT_WITH_CUDA
 
 TEST_CASE("matmal-gpu-gemm7-benchmark") {

diff --git a/cmake/cuda.cmake b/cmake/cuda.cmake
@@ -25,4 +25,4 @@ foreach(X ${ARCH_FLAGS})
 endforeach()
 
 message(STATUS "Generating CUDA code for ${CUDA_VERSION} SMs: ${CUDA_FLAGS}")
-set(CMAKE_CUDA_FLAGS "${CUDA_FLAGS} -Xcompiler -Wall -std=c++11 --expt-relaxed-constexpr --use_fast_math --expt-extended-lambda")
+set(CMAKE_CUDA_FLAGS "${CUDA_FLAGS} -Xcompiler -Wall --expt-relaxed-constexpr --use_fast_math --expt-extended-lambda")