Skip to content

Latest commit

 

History

History
200 lines (94 loc) · 7.93 KB

File metadata and controls

200 lines (94 loc) · 7.93 KB

XDMA AXI4-Stream Demo with H2C Bus Widened to 512-Bit

PCIe XDMA to AXI4-Stream with a 512-Bit H2C Bus. Demonstration for the Innova-2 using Vivado 2022.2. Stream multiplies Floating-Point numbers.

Block Design

PCIe XDMA AXI-Stream Block Diagram

AXI-Lite Addresses

PCIe XDMA AXILite Block Addresses

The AXI-Lite BAR has a 0x40000000 PCIe to AXI Translation offset.

AXILite BAR Setup

Bitstream

Recreate the bitstream. Download xdma_stream_512bit.tcl and constraints.xdc. source the Tcl script in the Vivado 2022.2 Tcl Console then run Generate Bitstream.

Load the bitstream into your Innova-2. It should work with every variant of the Innova-2. Refer to innova2_flex_xcku15p_notes for system setup.

pwd
cd DOWNLOAD_DIRECTORY
dir
source xdma_stream_512bit.tcl

Vivado Source Tcl Project Script

Generate the bitstream:

Vivado Generate Bitstream

Resources used for the design:

FPGA Resources Used

Testing

Check Drivers and Hardware

Confirm the xdma driver has loaded and the hardware is recognized and operating as expected.

sudo lspci -vnn -d 10ee: ; sudo lspci -vvnn -d 10ee: | grep Lnk
sudo lspci -vv -d 15b3:1974 | grep "Mellanox\|LnkSta"

lspci for Innova2

dmesg | grep xdma will detail how the XDMA driver has loaded.

dmesg grep xdma

ls /dev/xdma* will show all character device files associated with the XDMA driver.

ls dev xdma_0 Device Files

Test Software

Compile and run the test program.

gcc -Wall stream_test.c -o stream_test -lm  ;  sudo ./stream_test

stream_test Run

Every once in a while there will be a problem with communication. A portion of the resulting C2H floating-point array gets shifted by a few indices. I have run the core pwrite+pread loop millions of times and problems pop up early.

Errors Encountered

Errors Encountered

Data Throughput Tests

By using /dev/zero as the source of data and /dev/null as the sink with dd you can experiment with data throughput vs. count= and bs= (Block Size) values. Channel 1, xdma0_h2c_1 and xdma0_c2h_1, are shorted for loopback. This gives an estimate for the maximum possible throughput.

Stream Channel1 is Loopback

In one terminal:

sudo dd if=/dev/zero of=/dev/xdma0_h2c_1 count=32768 bs=16384

In a second terminal:

sudo dd if=/dev/xdma0_c2h_1 of=/dev/null count=32768 bs=16384

Data Transfer Throughput Bandwidth Test

The H2C throughput will be slower as it includes the time it takes you to switch to the second window and start the second dd.

Design Details

The maximum width for the AXI Bus with a PCIe 3.0 x8 design is 256-Bit but a 512-Bit stream is required.

Stream Channel1 is Loopback

The goal is to re-clock and channel the data through the stream. Clocks and resets are carefully managed. tkeep and tlast signals are omitted from all blocks as they are not used.

Clocking

In order to widen the 256-Bit AXI4-Stream bus to 512-Bit the 250MHz axi_aclk clock is halved in order to maintain the same bandwidth.

Clocking Wizard Settings

RESETs

Each clock needs an associated aresetn synchronized to it and controllable by a GPIO signal to allow resetting the stream.

Processor System Reset Blocks

FIFOs

Input (Host-to-Card H2C) and output (Card-to-Host C2H) FIFOs were added to increase througput. The output C2H FIFO has the minimum depth of 16.

C2H FIFO Settings

To match throughput the input H2C FIFO has a depth of 32 as its stream uses twice as many bits.

H2C FIFO Settings

Data Width Converter

The 256-Bit XDMA Block H2C stream is widened to 512-Bit using a Data Width Converter.

H2C Data Width Converter

Clock Converter

The input (H2C) data stream is re-timed to 125MHz (half of the XDMA block's axi_aclk) which is used by the stream blocks.

H2C Clock Converter

The output (C2H) data stream is re-timed back to the 250MHz axi_aclk before going into the XDMA block.

C2H Clock Converter

Broadcaster and Combiner

The 512-Bit=64-Byte H2C data stream is split/broadcast into sixteen 32-Bit=4-Byte streams for the floating-point units.

H2C Broadcaster Settings

The bits of each 32-Bit=4-Byte stream are appropriately selected from the 512-Bit stream.

Broadcaster Splitting Options

The floating-point unit results are combined into the 256-Bit output C2H stream.

C2H Combiner Settings

Stream Does Something

I put Floating-Point blocks in the stream as an example of something useful. Each pair of 4-byte=32-bit single precision floating-point values in the 64-Byte=512-Bit Host-to-Card (H2C) stream gets multiplied to produce a floating-point value in the 32-Byte=256-Bit Card-to-Host (C2H) stream.

The floating-point blocks are set up to multiply their inputs.

Floating-Point Block Settings

Full DSP usage is allowed to maximize throughput.

Floating-Point Block Optimization Settings

The interface is set up as Blocking so that the AXI4-Stream interfaces include tready signals like the rest of the Stream blocks.

Floating-Point Block Interface Settings