Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize host kernel invoking performance #128

Open
ueqri opened this issue Dec 8, 2022 · 6 comments
Open

Optimize host kernel invoking performance #128

ueqri opened this issue Dec 8, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@ueqri
Copy link

ueqri commented Dec 8, 2022

Hi, in our experience with TAPA, the tapa::invoke in host side is quite time-consuming when we want to re-run a kernel multiple times - involving two unnecessary procedures: (1) re-programming bitstream to FPGA, (2) re-transferring data back and forth.

Specifically, for this workflow in host program:

  1. Set kernel arguments
  2. Write scalar and buffer to device
  3. Execute kernel
  4. Read partial results (not full buffer) from device
  5. Re-set kernel arguments
  6. Write only scalar to device
  7. Re-execute kernel (w/ different arguments)
  8. Repeat (4) until exit condition is true
  9. Read all outputs (full buffer) from device

We found there is no API supported (for host program) in TAPA library for such workflow, unless (1) directly using fpga-runtime library as this, or (2) using XRT/OpenCL APIs manually w/o above wrappers (but it is tricky and may interfere other parts like TAPA simulation).

So I was wondering if there is any chance so far to implement the above workflow with TAPA, or any plan to support a fine-grained kernel invoking in the future? Thank you!

(PS: in addition to implement this workflow from host side, we have considered an on-board scheduler, but we believe it may be detrimental to frequency and area since our kernel is already large enough.)

@Blaok
Copy link
Collaborator

Blaok commented Dec 8, 2022

I would recommend using the FPGA runtime library unless by "read partial results" you mean reading only part of a single buffer, which can be achieved only with the XRT/OpenCL APIs. You can use fpga::Instance::SuspendBuf to skip a buffer during WriteToDevice and ReadFromDevice. To re-enable a buffer for data transfer, simply call SetArg again.

Example:

// This is the kernel function.
void Kernel(int scalar, tapa::mmap<float> buf1, tapa::mmap<float> buf2);

// Make sure to 4k-align the host memory to avoid additional copy.
template <typename T>                                              
using aligned_vector = std::vector<T, tapa::aligned_allocator<T>>; 

...
int main() {
  
  // Load the bitstream and acquire the FPGA resource.
  fpga::Instance instance("/path/to/bitstream");

  // Prepare the arguments.
  int scalar_arg = 42;  // Make sure the scalar arguments have the correct size.
  aligned_vector<float> buf1_vec, buf2_vec;
  // Note that fpga::ReadOnly corresponds to tapa::write_only_mmap and
  // fpga::WriteOnly corresponds to tapa::read_only_mmap. This is because
  // the FPGA runtime library is host-centric but TAPA is kernel-centric.
  auto buf1_arg = fpga::ReadWrite(buf1_vec.data(), buf1_vec.size());
  auto buf2_arg = fpga::ReadWrite(buf2_vec.data(), buf2_vec.size());

  // Set kernel arguments.
  instance.SetArgs(scalar_arg, buf1_arg, buf2_arg);
  /* This is equivalent to the following:
   *  instance.SetArg(0, scalar_arg);
   *  instance.SetArg(1, buf1_arg);
   *  instance.SetArg(2, buf2_arg);
   */

  // Write both buffers to device
  instance.WriteToDevice();

  for (;;) {
    // Execute kernel
    instance.Exec();

    // Wait until previous operations finish
    instance.Finish();
  
    // Read only buf1 from device (skip buf2)
    instance.SuspendBuf(2);
    instance.ReadFromDevice();

    if (...) break;
    // ...
  
    // Re-set kernel arguments (scalar arguments do not need explicit WriteToDevice)
    instance.SetArg(0, scalar_arg);
  
    // Re-execute kernel (w/ different arguments) in the next iteration
  };

  // Read all outputs (full buffer) from device
  instance.SetArg(2, buf2_arg);
  instance.ReadFromDevice();
  instance.Finish();

  // ...
}

@ueqri
Copy link
Author

ueqri commented Dec 9, 2022

Thanks for you kind reply and example! It really helps for our development.

BTW, is there any chance to use the TAPA-style reinterpret method like tapa::read_only_mmap<T0>(buf).reinterpret<T1>() in fpga-runtime API? If so, could you enlighten me about the best practice? I didn't find the hints in the library codebase. Thank you!

@Blaok
Copy link
Collaborator

Blaok commented Dec 11, 2022

It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), fpga::ReadOnly(reinterpret_cast<T1*>(buf.data()), buf.size() * sizeof(T0) / sizeof(T1)) would work, assuming 1) buf.data() is properly aligned, and 2) buf.size() * sizeof(T0) can by evenly divided by sizeof(T1).

@Blaok
Copy link
Collaborator

Blaok commented Dec 13, 2022

@ueqri FYI I added a Reinterpret method. Please upgrade to libfrt-dev 0.0.20221212.1 to use it.

@linghaosong
Copy link

linghaosong commented Aug 31, 2023

It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), fpga::ReadOnly(reinterpret_cast<T1*>(buf.data()), buf.size() * sizeof(T0) / sizeof(T1)) would work, assuming 1) buf.data() is properly aligned, and 2) buf.size() * sizeof(T0) can by evenly divided by sizeof(T1).

xilinx 官方例子又不说清楚,真坑,看到这里意识到先前的一个bug。。。

@linghaosong
Copy link

linghaosong commented Aug 31, 2023

BTW, I post here how to use XCL to do pointer cast between host the kernel for the reference of someone who has similar issue and happens to walk here.

You need to do cl::Buffer(context, CL_MEM_XXXX | CL_XXXXX, arr.size() * sizeof(T_data_type_HOST),reinterpret_cast<T_data_type_KERNEL*>(arr.data()), &err));

Be careful with the size you enter for the cl::Buffer, it is the size of bytes in the arr.

@dotkrnl dotkrnl added the enhancement New feature or request label Sep 19, 2024
@dotkrnl dotkrnl changed the title Inquire about host kernel invoking Optimize host kernel invoking performance Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants