Optimize host kernel invoking performance #128

ueqri · 2022-12-08T01:21:59Z

Hi, in our experience with TAPA, the tapa::invoke in host side is quite time-consuming when we want to re-run a kernel multiple times - involving two unnecessary procedures: (1) re-programming bitstream to FPGA, (2) re-transferring data back and forth.

Specifically, for this workflow in host program:

Set kernel arguments
Write scalar and buffer to device
Execute kernel
Read partial results (not full buffer) from device
Re-set kernel arguments
Write only scalar to device
Re-execute kernel (w/ different arguments)
Repeat (4) until exit condition is true
Read all outputs (full buffer) from device

We found there is no API supported (for host program) in TAPA library for such workflow, unless (1) directly using fpga-runtime library as this, or (2) using XRT/OpenCL APIs manually w/o above wrappers (but it is tricky and may interfere other parts like TAPA simulation).

So I was wondering if there is any chance so far to implement the above workflow with TAPA, or any plan to support a fine-grained kernel invoking in the future? Thank you!

(PS: in addition to implement this workflow from host side, we have considered an on-board scheduler, but we believe it may be detrimental to frequency and area since our kernel is already large enough.)

The text was updated successfully, but these errors were encountered:

Blaok · 2022-12-08T04:34:55Z

I would recommend using the FPGA runtime library unless by "read partial results" you mean reading only part of a single buffer, which can be achieved only with the XRT/OpenCL APIs. You can use fpga::Instance::SuspendBuf to skip a buffer during WriteToDevice and ReadFromDevice. To re-enable a buffer for data transfer, simply call SetArg again.

Example:

// This is the kernel function.
void Kernel(int scalar, tapa::mmap<float> buf1, tapa::mmap<float> buf2);

// Make sure to 4k-align the host memory to avoid additional copy.
template <typename T>                                              
using aligned_vector = std::vector<T, tapa::aligned_allocator<T>>; 

...
int main() {
  
  // Load the bitstream and acquire the FPGA resource.
  fpga::Instance instance("/path/to/bitstream");

  // Prepare the arguments.
  int scalar_arg = 42;  // Make sure the scalar arguments have the correct size.
  aligned_vector<float> buf1_vec, buf2_vec;
  // Note that fpga::ReadOnly corresponds to tapa::write_only_mmap and
  // fpga::WriteOnly corresponds to tapa::read_only_mmap. This is because
  // the FPGA runtime library is host-centric but TAPA is kernel-centric.
  auto buf1_arg = fpga::ReadWrite(buf1_vec.data(), buf1_vec.size());
  auto buf2_arg = fpga::ReadWrite(buf2_vec.data(), buf2_vec.size());

  // Set kernel arguments.
  instance.SetArgs(scalar_arg, buf1_arg, buf2_arg);
  /* This is equivalent to the following:
   *  instance.SetArg(0, scalar_arg);
   *  instance.SetArg(1, buf1_arg);
   *  instance.SetArg(2, buf2_arg);
   */

  // Write both buffers to device
  instance.WriteToDevice();

  for (;;) {
    // Execute kernel
    instance.Exec();

    // Wait until previous operations finish
    instance.Finish();
  
    // Read only buf1 from device (skip buf2)
    instance.SuspendBuf(2);
    instance.ReadFromDevice();

    if (...) break;
    // ...
  
    // Re-set kernel arguments (scalar arguments do not need explicit WriteToDevice)
    instance.SetArg(0, scalar_arg);
  
    // Re-execute kernel (w/ different arguments) in the next iteration
  };

  // Read all outputs (full buffer) from device
  instance.SetArg(2, buf2_arg);
  instance.ReadFromDevice();
  instance.Finish();

  // ...
}

ueqri · 2022-12-09T00:47:22Z

Thanks for you kind reply and example! It really helps for our development.

BTW, is there any chance to use the TAPA-style reinterpret method like tapa::read_only_mmap<T0>(buf).reinterpret<T1>() in fpga-runtime API? If so, could you enlighten me about the best practice? I didn't find the hints in the library codebase. Thank you!

Blaok · 2022-12-11T00:45:19Z

It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), fpga::ReadOnly(reinterpret_cast<T1*>(buf.data()), buf.size() * sizeof(T0) / sizeof(T1)) would work, assuming 1) buf.data() is properly aligned, and 2) buf.size() * sizeof(T0) can by evenly divided by sizeof(T1).

Blaok · 2022-12-13T06:03:45Z

@ueqri FYI I added a Reinterpret method. Please upgrade to libfrt-dev 0.0.20221212.1 to use it.

linghaosong · 2023-08-31T16:23:16Z

It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), fpga::ReadOnly(reinterpret_cast<T1*>(buf.data()), buf.size() * sizeof(T0) / sizeof(T1)) would work, assuming 1) buf.data() is properly aligned, and 2) buf.size() * sizeof(T0) can by evenly divided by sizeof(T1).

xilinx 官方例子又不说清楚，真坑，看到这里意识到先前的一个bug。。。

linghaosong · 2023-08-31T19:27:18Z

BTW, I post here how to use XCL to do pointer cast between host the kernel for the reference of someone who has similar issue and happens to walk here.

You need to do cl::Buffer(context, CL_MEM_XXXX | CL_XXXXX, arr.size() * sizeof(T_data_type_HOST),reinterpret_cast<T_data_type_KERNEL*>(arr.data()), &err));

Be careful with the size you enter for the cl::Buffer, it is the size of bytes in the arr.

dotkrnl added the enhancement New feature or request label Sep 19, 2024

dotkrnl changed the title ~~Inquire about host kernel invoking~~ Optimize host kernel invoking performance Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize host kernel invoking performance #128

Optimize host kernel invoking performance #128

ueqri commented Dec 8, 2022

Blaok commented Dec 8, 2022

ueqri commented Dec 9, 2022

Blaok commented Dec 11, 2022

Blaok commented Dec 13, 2022

linghaosong commented Aug 31, 2023 •

edited

Loading

linghaosong commented Aug 31, 2023 •

edited

Loading

Optimize host kernel invoking performance #128

Optimize host kernel invoking performance #128

Comments

ueqri commented Dec 8, 2022

Blaok commented Dec 8, 2022

ueqri commented Dec 9, 2022

Blaok commented Dec 11, 2022

Blaok commented Dec 13, 2022

linghaosong commented Aug 31, 2023 • edited Loading

linghaosong commented Aug 31, 2023 • edited Loading

linghaosong commented Aug 31, 2023 •

edited

Loading

linghaosong commented Aug 31, 2023 •

edited

Loading