Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paddle-Lite OpenCL后端整体架构 #53

Open
ysh329 opened this issue Feb 22, 2021 · 2 comments
Open

Paddle-Lite OpenCL后端整体架构 #53

ysh329 opened this issue Feb 22, 2021 · 2 comments

Comments

@ysh329
Copy link
Owner

ysh329 commented Feb 22, 2021

Paddle-Lite OpenCL后端主要分为如下4部分:

image

  1. CLWrapper(static):寻找设备上的动态库、检查符号、创建OpenCL API函数指针等;
  2. CLRuntime(static):完成OpenCL平台、设备的初始化,上下文、命令队列、cl::Program的创建;
  3. CLContext:cl::Kernel的创建、设置LocalWorkSize、GlobalWorkSize、AutoTune等;
  4. TargetWrapper:对CLWrapper的部分OpenCL API函数做框架层面的封装,如Image2D和Buffer的Malloc、Free等。

其它琐碎的地方:

  1. cl_image_convertor:将NCHW的数据排布,在CPU上完成Image2D的排布转换,为上传到GPU上做准备;
  2. cl_utility:封装EnqueueNDRangeKernel,封装mutable_data等;
  3. opencl_kernel_sources:包含cl kernel代码。
@ysh329
Copy link
Owner Author

ysh329 commented Feb 22, 2021

OpenCL的模型转换

image

OpenCL的Predictor创建

image

OpenCL的内存管理

image

AutoTune

image

cl::NDRange CLContext::DefaultLocalWorkSize(
    const cl::NDRange &gws,
    register size_t max_ws,
    const int &divisor /*=2*/,
    const bool &reverse /*=false*/,
    const size_t &user_def_max_ws /*=0*/) {
  register size_t lx = reverse ? gws[2] : gws[0];
  register size_t ly = gws[1];
  register size_t lz = reverse ? gws[0] : gws[2];

  max_ws = (user_def_max_ws > 0 && user_def_max_ws <= max_ws) ? user_def_max_ws
                                                              : max_ws;
  max_ws = divisor > 1 ? max_ws / divisor : max_ws;

  if (max_ws > 0) {
    while (ly > max_ws) {
      // replace mod with bit operate
      ly = (ly & 0x01) ? 1 : ly >> 1;
    }
    while (ly * lz > max_ws) {
      lz = (lz & 0x01) ? 1 : lz >> 1;
    }
    while (ly * lz * lx > max_ws) {
      lx = (lx & 0x01) ? 1 : lx >> 1;
    }
  }

  return reverse ? cl::NDRange{lz, ly, lx} : cl::NDRange{lx, ly, lz};
}

精度设置API架构

背景:增加该API的目的是:在X86 MacOS和X86 Linux系统上,OpenCL设备支持FP32精度计算,同时,也是为了验证一套opencl kernel代码的计算结果是正确的(因为FP16确实有累计系统误差的问题),而非Kernel代码有bug。

精度API提供3种精度设置:

  1. CL_PRECISION_AUTO:默认选项,自动精度选择,优先考虑性能,选择低精度的FP16;
  2. CL_PRECISION_FP16:强制选择FP16执行,若设备不支持FP16,则会报错abort并退出;
  3. CL_PRECISION_FP32:强制选择FP32精度执行,同上。

类似AutoTune,只不过将实现的过程放在了CLRuntime中,因为这部分需要作为build option作为参数传入opencl kernel里,子在opencl kernel内的精度用宏来表示如CL_DTYPE_float表示定义CL_DTYPE为float,计算精度为float,CL_DTYPE_half同理。

@ysh329 ysh329 changed the title Paddle-Lite OpenCL AutoTune Paddle-Lite OpenCL后端整体架构 Feb 22, 2021
@ysh329
Copy link
Owner Author

ysh329 commented Feb 23, 2021

Image2D的数据排布

image

static std::map<std::string, size_t> InitImageDimInfoWith(
    const DDim& tensor_dim) {
  size_t new_dims[] = {1, 1, 1, 1};
  for (size_t j = 0; j < tensor_dim.size(); ++j) {
    new_dims[4 - tensor_dim.size() + j] = tensor_dim[j];
  }
  size_t N, C, H, W;
  N = new_dims[0];
  C = new_dims[1];
  H = new_dims[2];
  W = new_dims[3];
  size_t width = W * ((C + 3) / 4);
  size_t height = H * N;
  return std::map<std::string, size_t>({{"width", width}, {"height", height}});
}

如果维度不够4维如只有2维度,则将2维度前面补1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant