2024.1 release documentation

Xilinx · Jul 4, 2024 · c1c3425 · c1c3425
1 parent 3309aa6
commit c1c3425
Show file tree

Hide file tree

Showing 157 changed files with 97,332 additions and 0 deletions.
diff --git a/2024.1/html/_sources/asynchronous_xrt.rst.txt b/2024.1/html/_sources/asynchronous_xrt.rst.txt
@@ -0,0 +1,58 @@
+Asynchronous XRT (XRT Native API's)
+===================================
+
+This is simple example which showcases asynchronous programming mechanism through the user-defined queues.
+
+**KEY CONCEPTS:** `XRT Native API <https://docs.xilinx.com/r/en-US/ug1393-vitis-application-acceleration/Setting-Up-XRT-Managed-Kernels-and-Kernel-Arguments>`__, `Asynchronous Programming <https://xilinx.github.io/XRT/2023.1/html/xrt_native_apis.html?highlight=queue#asynchornous-programming-with-xrt-experimental>`__
+
+**KEYWORDS:** `xrt::queue <https://xilinx.github.io/XRT/2023.1/html/xrt_native_apis.html?highlight=queue#executing-multiple-tasks-through-queue>`__, `enqueue <https://xilinx.github.io/XRT/2023.1/html/xrt_native_apis.html?highlight=queue#executing-multiple-tasks-through-queue>`__, `wait() <https://xilinx.github.io/XRT/2023.1/html/xrt_native_apis.html?highlight=queue#executing-multiple-tasks-through-queue>`__
+
+
+In this example we showcase asynchronous programming mechanism through the user-defined queues. The ``xrt::queue`` is lightweight, general-purpose queue implementation which is completely separated from core XRT native API data structures.
+
+XRT queue implementation needs ``#include <experimental/xrt_queue.h`` to be added as the header file. The implementation also use C++17 features so the host code must be compiled with ``g++ -std=c++17``.
+
+Executing multiple tasks through queue 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: c++
+      :number-lines: 84
+
+
+          xrt::queue main_queue;
+          xrt::queue queue_bo1;
+          auto bo0_event = main_queue.enqueue([&bo0] {bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
+          auto bo1_event = queue_bo1.enqueue([&bo1] {bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); });
+          main_queue.enqueue(bo1_event); 
+          main_queue.enqueue([&run] {run.start(); run.wait(); });
+          auto bo_out_event = main_queue.enqueue([&bo_out] {bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); });
+          bo_out_event.wait();
+
+
+In line number 86 and 87, ``bo0`` and ``bo1`` host-to-device data transfers are enqueued through two separate queues to achieve parallel transfers. To synchronize between these two queues, the returned event from the ``queue_bo1`` is enqueued in the ``main_queue``, similar to a task enqueue (line 88). As a result, any other task submitted after that event won't execute until the event is finished. So, in the above code example, subsequent task in the ``main_queue`` (such as kernel execution) would wait till the ``bo1_event`` is completed. By submitting an event returned from a ``queue::enqueue`` to another queue, we can synchronize among the queues.
+
+**EXCLUDED PLATFORMS:** 
+
+ - All NoDMA Platforms, i.e u50 nodma etc
+
+DESIGN FILES
+------------
+
+Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below
+
+::
+
+   src/host.cpp
+   src/vadd.cpp
+   
+Access these files in the github repo by `clicking here <https://github.com/Xilinx/Vitis_Accel_Examples/tree/master/host_xrt/asynchronous_xrt>`__.
+
+COMMAND LINE ARGUMENTS
+----------------------
+
+Once the environment has been configured, the application can be executed by
+
+::
+
+   ./asynchronous_xrt -x <vadd XCLBIN>
+
diff --git a/2024.1/html/_sources/axi_burst_performance.rst.txt b/2024.1/html/_sources/axi_burst_performance.rst.txt
@@ -0,0 +1,225 @@
+AXI Burst Performance
+=====================
+
+This is an AXI Burst Performance check design. It measures the time it takes to write a buffer into DDR or read a buffer from DDR. The example contains 2 sets of 6 kernels each: each set having a different data width and each kernel having a different burst_length and num_outstanding parameters to compare the impact of these parameters on effective throughput.
+
+
+This is an AXI Burst Performance check design. It measures the time it takes to write a buffer into DDR or read a buffer from DDR. The example contains 2 sets of 6 kernels each: each set having a different data width and each kernel having a different burst_length and num_outstanding parameters to compare the impact of these parameters on effective throughput.
+
+A counter is coded inside each of the kernels to accurately count the number of cycles between the start and end of the buffer transfer.
+
+In this version, the kernels are configured as follows:
+
+::
+
+   Data Width - 256
+   test_kernel_maxi_256bit_1: burst length= 4, outstanding transactions=4
+   test_kernel_maxi_256bit_2: burst length=16, outstanding transactions=4
+   test_kernel_maxi_256bit_3: burst length=32, outstanding transactions=4
+   test_kernel_maxi_256bit_4: burst length= 4, outstanding transactions=32
+   test_kernel_maxi_256bit_5: burst length=16, outstanding transactions=32
+   test_kernel_maxi_256bit_6: burst length=32, outstanding transactions=32
+   
+   Data Width - 512
+   test_kernel_maxi_512bit_1: burst length= 4, outstanding transactions=4
+   test_kernel_maxi_512bit_2: burst length=16, outstanding transactions=4
+   test_kernel_maxi_512bit_3: burst length=32, outstanding transactions=4
+   test_kernel_maxi_512bit_4: burst length= 4, outstanding transactions=32
+   test_kernel_maxi_512bit_5: burst length=16, outstanding transactions=32
+   test_kernel_maxi_512bit_6: burst length=32, outstanding transactions=32
+
+Below are the resource numbers while running the design on U200 platform:
+
+Data Width - 256
+
+========================= ==== ==== ====
+Kernel                    LUT  REG  BRAM
+========================= ==== ==== ====
+test_kernel_maxi_256bit_1 4.2K 7.2K 11  
+test_kernel_maxi_256bit_2 4.3K 7.2K 11  
+test_kernel_maxi_256bit_3 4.4K 7.3K 11  
+test_kernel_maxi_256bit_4 4.3K 7.2K 11  
+test_kernel_maxi_256bit_5 4.3K 7.3K 11  
+test_kernel_maxi_256bit_6 4.5K 7.1K 15  
+========================= ==== ==== ====
+
+Data Width - 512
+
+========================= ==== ==== ====
+Kernel                    LUT  REG  BRAM
+========================= ==== ==== ====
+test_kernel_maxi_512bit_1 4.8K 9.0K 14  
+test_kernel_maxi_512bit_2 4.9K 9.1K 14  
+test_kernel_maxi_512bit_3 5.2K 9.1K 14  
+test_kernel_maxi_512bit_4 4.9K 9.1K 14  
+test_kernel_maxi_512bit_5 4.9K 9.1K 14  
+test_kernel_maxi_512bit_6 5.2K 9.0K 23  
+========================= ==== ==== ====
+
+Following is the real log reported while running the design on U200 platform for 16 KB transfers:
+
+::
+
+   Test parameters
+    - xclbin file   : ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_256bit.xclbin
+    - frequency     : 300 MHz
+    - buffer size   : 16.00 KB
+   
+   Found Platform
+   Platform Name: Xilinx
+   INFO: Reading ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_256bit.xclbin
+   Loading: './build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_256bit.xclbin'
+   Trying to program device[1]: xilinx_u200_xdma_201830_2
+   Device[1]: program successful!
+   
+   Kernel->AXI Burst WRITE performance
+   Data Width = 256 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 2.55877 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 6.31398 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 6.84251 GB/sec
+   Data Width = 256 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 4.26223 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 6.45647 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 6.84251 GB/sec
+   
+   Kernel->AXI Burst READ performance
+   Data Width = 256 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 2.01658 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 6.54884 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 7.79836 GB/sec
+   Data Width = 256 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 7.7851 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 7.79836 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 7.79836 GB/sec
+   
+   Test parameters
+    - xclbin file   : ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_512bit.xclbin
+    - frequency     : 300 MHz
+    - buffer size   : 16.00 KB
+   
+   Found Platform
+   Platform Name: Xilinx
+   INFO: Reading ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_512bit.xclbin
+   Loading: './build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_512bit.xclbin'
+   Trying to program device[1]: xilinx_u200_xdma_201830_2
+   Device[1]: program successful!
+   
+   Kernel->AXI Burst WRITE performance
+   Data Width = 512 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 5.17832 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 8.23316 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 11.5306 GB/sec
+   Data Width = 512 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 8.10201 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 11.5016 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 11.2473 GB/sec
+   
+   Kernel->AXI Burst READ performance
+   Data Width = 512 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 4.04385 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 11.6776 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 KB | throughput = 13.6646 GB/sec
+   Data Width = 512 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 13.6646 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 13.6646 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 KB | throughput = 13.6646 GB/sec
+   
+   TEST PASSED
+
+Following is the real log reported while running the design on U200 platform for 16 MB transfers:
+
+::
+
+   Test parameters
+    - xclbin file   : ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_256bit.xclbin
+    - frequency     : 300 MHz
+    - buffer size   : 16.00 MB
+   
+   Found Platform
+   Platform Name: Xilinx
+   INFO: Reading ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_256bit.xclbin
+   Loading: './build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_256bit.xclbin'
+   Trying to program device[1]: xilinx_u200_xdma_201830_2
+   Device[1]: program successful!
+   
+   Kernel->AXI Burst WRITE performance
+   Data Width = 256 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 2.66919 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 6.62449 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 7.59737 GB/sec
+   Data Width = 256 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 4.47013 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 7.1518 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 7.94597 GB/sec
+   
+   Kernel->AXI Burst READ performance
+   Data Width = 256 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 2.02206 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 6.80909 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 8.59958 GB/sec
+   Data Width = 256 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 8.68773 GB/sec
+   Data Width = 256 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 8.93942 GB/sec
+   Data Width = 256 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 8.93942 GB/sec
+   
+   Test parameters
+    - xclbin file   : ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_512bit.xclbin
+    - frequency     : 300 MHz
+    - buffer size   : 16.00 MB
+   
+   Found Platform
+   Platform Name: Xilinx
+   INFO: Reading ./build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_512bit.xclbin
+   Loading: './build_dir.hw.xilinx_u200_xdma_201830_2/test_kernel_maxi_512bit.xclbin'
+   Trying to program device[1]: xilinx_u200_xdma_201830_2
+   Device[1]: program successful!
+   
+   Kernel->AXI Burst WRITE performance
+   Data Width = 512 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 5.1399 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 11.7942 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 14.6941 GB/sec
+   Data Width = 512 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 8.93979 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 14.3008 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 15.1586 GB/sec
+   
+   Kernel->AXI Burst READ performance
+   Data Width = 512 burst_length = 4 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 3.92988 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 13.1114 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 4 buffer_size = 16.00 MB | throughput = 16.8218 GB/sec
+   Data Width = 512 burst_length = 4 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 16.8222 GB/sec
+   Data Width = 512 burst_length = 16 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 16.8295 GB/sec
+   Data Width = 512 burst_length = 32 num_outstanding = 32 buffer_size = 16.00 MB | throughput = 16.8219 GB/sec
+   
+   TEST PASSED
+
+**EXCLUDED PLATFORMS:** 
+
+ - All Embedded Zynq Platforms, i.e zc702, zcu102 etc
+ - All Versal Platforms, i.e vck190 etc
+ - AWS VU9P F1
+ - Samsung SmartSSD Computation Storage Drive
+ - Samsung U.2 SmartSSD
+ - All NoDMA Platforms, i.e u50 nodma etc
+ - Versal V70
+
+DESIGN FILES
+------------
+
+Application code is located in the src directory. Accelerator binary files will be compiled to the xclbin directory. The xclbin directory is required by the Makefile and its contents will be filled during compilation. A listing of all the files in this example is shown below
+
+::
+
+   src/host.cpp
+   src/test_kernel_common.hpp
+   src/test_kernel_maxi_256bit_1.cpp
+   src/test_kernel_maxi_256bit_2.cpp
+   src/test_kernel_maxi_256bit_3.cpp
+   src/test_kernel_maxi_256bit_4.cpp
+   src/test_kernel_maxi_256bit_5.cpp
+   src/test_kernel_maxi_256bit_6.cpp
+   src/test_kernel_maxi_512bit_1.cpp
+   src/test_kernel_maxi_512bit_2.cpp
+   src/test_kernel_maxi_512bit_3.cpp
+   src/test_kernel_maxi_512bit_4.cpp
+   src/test_kernel_maxi_512bit_5.cpp
+   src/test_kernel_maxi_512bit_6.cpp
+   
+Access these files in the github repo by `clicking here <https://github.com/Xilinx/Vitis_Accel_Examples/tree/master/performance/axi_burst_performance>`__.
+
+COMMAND LINE ARGUMENTS
+----------------------
+
+Once the environment has been configured, the application can be executed by
+
+::
+
+   ./axi_burst_performance -x1 <test_kernel_maxi_256bit XCLBIN> -x2 <test_kernel_maxi_512bit XCLBIN>
+
diff --git a/2024.1/html/_sources/common.rst.txt b/2024.1/html/_sources/common.rst.txt
@@ -0,0 +1,9 @@
+Common Files
+============
+
+The common files' section contains:
+
+- Collection of common files used across all examples to assist in the quick development of application host code. 
+- Collection of utility functions used as part of the Makefiles in all of the examples. This set includes Makefile rules and scripts to launch Vitis compiled applications onto boards hosted by Nimbix directly from the developers terminal shell.
+
+
diff --git a/2024.1/html/_sources/compile_execute.rst.txt b/2024.1/html/_sources/compile_execute.rst.txt
@@ -0,0 +1,90 @@
+Compilation and Execution
+=========================
+
+It is primarily recommended to start with Hello World example which makes the new users aware about the basic structure of a Vitis based Application.
+
+Compiling for Application Emulation
+-----------------------------------
+
+As part of the capabilities available to an application developer, Vitis includes environments to test the correctness of an application at both a software functional level and a hardware emulated level.
+
+These modes, which are named sw_emu and hw_emu, allow the developer to profile and evaluate the performance of a design before compiling for board execution.
+It is recommended that all applications are executed in at least the sw_emu mode before being compiled and executed on an FPGA board.
+
+For DC platforms:
+
+::
+
+    cd <PATH TO SAMPLE APPLICATION>
+    make all TARGET=<sw_emu|hw_emu> PLATFORM=<FPGA Platform>
+
+For SoC platforms:
+
+::
+
+    cd <PATH TO SAMPLE APPLICATION>
+    make all TARGET=<sw_emu|hw_emu> PLATFORM=<FPGA platform> HOST_ARCH=<aarch32/aarch64> EDGE_COMMON_SW=<rootfs and kernel image path> 
+
+where,
+
+	*sw_emu = software emulation*,
+	*hw_emu = hardware emulation*
+
+By default, HOST_ARCH=x86. HOST_ARCH and EDGE_COMMON_SW are required for SoC shells. Please download and use the pre-built image from `here <https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-platforms.html>`__.
+
+**NOTE:** The software emulation flow is a functional correctness check only. It does not estimate the performance of the application in hardware.
+
+The hardware emulation flow is a cycle accurate simulation of the hardware generated for the application. Also, it is expected for this simulation to take a long time.
+
+Executing Emulated Application
+------------------------------
+*(Recommended Execution Flow for Example Applications in Emulation)*
+
+The makefile for the application can directly execute the application with the following command:
+
+For DC platforms:
+
+::
+
+    cd <PATH TO SAMPLE APPLICATION>
+    make run TARGET=<sw_emu|hw_emu> PLATFORM=<FPGA Platform>
+
+For SoC platforms:
+
+::
+
+    cd <PATH TO SAMPLE APPLICATION>
+    make run TARGET=<sw_emu|hw_emu> PLATFORM=<FPGA platform> HOST_ARCH=<aarch32/aarch64> EDGE_COMMON_SW=<rootfs and kernel image path>
+
+where,
+
+	*sw_emu = software emulation*,
+	*hw_emu = hardware emulation*
+
+By default, HOST_ARCH=x86. HOST_ARCH and EDGE_COMMON_SW are required for SoC shells. Please download and use the pre-built image from `here <https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-platforms.html>`__.
+
+If the application has not been previously compiled, the check makefile rule will compile and execute the application in the emulation mode selected by the user.
+
+Compiling for FPGA Accelerator Card
+-----------------------------------
+
+The command to compile for the FPGA acceleration
+board is:
+
+For DC platforms:
+
+::
+
+    cd <PATH TO SAMPLE APPLICATION>
+    make all PLATFORM=<FPGA Platform> TARGET=<hw>
+
+For SoC platforms:
+
+::
+
+    cd <PATH TO SAMPLE APPLICATION>
+    make all PLATFORM=<FPGA Platform> TARGET=<hw>  HOST_ARCH=<aarch32/aarch64> EDGE_COMMON_SW=<rootfs and kernel image path>
+ 
+By default, HOST_ARCH=x86. HOST_ARCH and EDGE_COMMON_SW are required for SoC shells. Please download and use the pre-built image from `here <https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-platforms.html>`__.
+
+**NOTE:** Compilation for hardware generates custom logic to implement the functionality of the kernels in an application. It is typical for hardware compile times to range from 30 minutes to a couple of hours.