-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize host kernel invoking performance #128
Comments
I would recommend using the FPGA runtime library unless by "read partial results" you mean reading only part of a single buffer, which can be achieved only with the XRT/OpenCL APIs. You can use Example: // This is the kernel function.
void Kernel(int scalar, tapa::mmap<float> buf1, tapa::mmap<float> buf2);
// Make sure to 4k-align the host memory to avoid additional copy.
template <typename T>
using aligned_vector = std::vector<T, tapa::aligned_allocator<T>>;
...
int main() {
// Load the bitstream and acquire the FPGA resource.
fpga::Instance instance("/path/to/bitstream");
// Prepare the arguments.
int scalar_arg = 42; // Make sure the scalar arguments have the correct size.
aligned_vector<float> buf1_vec, buf2_vec;
// Note that fpga::ReadOnly corresponds to tapa::write_only_mmap and
// fpga::WriteOnly corresponds to tapa::read_only_mmap. This is because
// the FPGA runtime library is host-centric but TAPA is kernel-centric.
auto buf1_arg = fpga::ReadWrite(buf1_vec.data(), buf1_vec.size());
auto buf2_arg = fpga::ReadWrite(buf2_vec.data(), buf2_vec.size());
// Set kernel arguments.
instance.SetArgs(scalar_arg, buf1_arg, buf2_arg);
/* This is equivalent to the following:
* instance.SetArg(0, scalar_arg);
* instance.SetArg(1, buf1_arg);
* instance.SetArg(2, buf2_arg);
*/
// Write both buffers to device
instance.WriteToDevice();
for (;;) {
// Execute kernel
instance.Exec();
// Wait until previous operations finish
instance.Finish();
// Read only buf1 from device (skip buf2)
instance.SuspendBuf(2);
instance.ReadFromDevice();
if (...) break;
// ...
// Re-set kernel arguments (scalar arguments do not need explicit WriteToDevice)
instance.SetArg(0, scalar_arg);
// Re-execute kernel (w/ different arguments) in the next iteration
};
// Read all outputs (full buffer) from device
instance.SetArg(2, buf2_arg);
instance.ReadFromDevice();
instance.Finish();
// ...
} |
Thanks for you kind reply and example! It really helps for our development. BTW, is there any chance to use the TAPA-style reinterpret method like |
It is not available in the FPGA runtime library today. I'm happy to port it there, but I cannot promise any timeline. If you need it urgently (in 1 week or so), |
@ueqri FYI I added a |
xilinx 官方例子又不说清楚,真坑,看到这里意识到先前的一个bug。。。 |
BTW, I post here how to use XCL to do pointer cast between host the kernel for the reference of someone who has similar issue and happens to walk here. You need to do Be careful with the size you enter for the |
Hi, in our experience with TAPA, the
tapa::invoke
in host side is quite time-consuming when we want to re-run a kernel multiple times - involving two unnecessary procedures: (1) re-programming bitstream to FPGA, (2) re-transferring data back and forth.Specifically, for this workflow in host program:
We found there is no API supported (for host program) in TAPA library for such workflow, unless (1) directly using fpga-runtime library as this, or (2) using XRT/OpenCL APIs manually w/o above wrappers (but it is tricky and may interfere other parts like TAPA simulation).
So I was wondering if there is any chance so far to implement the above workflow with TAPA, or any plan to support a fine-grained kernel invoking in the future? Thank you!
(PS: in addition to implement this workflow from host side, we have considered an on-board scheduler, but we believe it may be detrimental to frequency and area since our kernel is already large enough.)
The text was updated successfully, but these errors were encountered: