- Xilinx Vivado 2022.2
- cmake 3.0 or higher
Supported boards (out of the box)
- Xilinx VC709
- Xilinx VCU118
- Alpha Data ADM-PCIE-7V3
This repository uses git submodules, so do one of the following:
# When cloning:
git clone --recurse-submodules [email protected]/this/repo.git
# Later, if you forgot or when submodules have been updated:
git submodule update --init --recursive
- Create a build directory
mkdir build
cd build
- Configure build
cmake .. -DFNS_PLATFORM=xilinx_u55c_gen3x16_xdma_3_202210_1 -DFNS_DATA_WIDTH=64
All cmake options:
Name | Values | Desription |
---|---|---|
FNS_PLATFORM | xilinx_u55c_gen3x16_xdma_3_202210_1 | Target platform to build |
FNS_DATA_WIDTH | <8,16,32,64> | Data width of the network stack in bytes |
FNS_ROCE_STACK_MAX_QPS | 500 | Maximum number of queue pairs the RoCE stack can support |
FNS_TCP_STACK_MSS | #value | Maximum segment size of the TCP/IP stack |
FNS_TCP_STACK_FAST_RETRANSMIT_EN | <0,1> | Enabling TCP fast retransmit |
FNS_TCP_STACK_NODELAY_EN | <0,1> | Toggles Nagle's Algorithm on/off |
FNS_TCP_STACK_MAX_SESSIONS | #value | Maximum number of sessions the TCP/IP stack can support |
FNS_TCP_STACK_RX_DDR_BYPASS_EN | <0,1> | Enabling DDR bypass on the RX path |
FNS_TCP_STACK_WINDOW_SCALING_EN | <0,1> | Enalbing TCP Window scaling option |
- Build HLS IP cores and install them into IP repository
make ip
For an example project including the TCP/IP stack or the RoCEv2 stack with DMA to host memory checkout our Distributed Accelerator OS DavOS.
- Setup build directory, e.g. for the TCP module
cd hls/toe
mkdir build
cd build
cmake .. -DFNS_PLATFORM=xilinx_u55c_gen3x16_xdma_3_202210_1 -DFNS_DATA_WIDTH=64
- Run
make csim # C-Simulation (csim_design)
make synth # Synthesis (csynth_design)
make cosim # Co-Simulation (cosim_design)
make ip # Export IP (export_design)
All interfaces are using the AXI4-Stream protocol. For AXI4-Streams carrying network/data packets, we use the following definition in HLS:
template <int D>
struct net_axis {
ap_uint<D> data;
ap_uint<D/8> keep;
ap_uint<1> last;
};
To open a connection the destination IP address and TCP port have to provided through the s_axis_open_conn_req
interface. The TCP stack provides an answer to this request through the m_axis_open_conn_rsp
interface which provides the sessionID and a boolean indicating if the connection was openend successfully.
Interface definition in HLS:
struct ipTuple {
ap_uint<32> ip_address;
ap_uint<16> ip_port;
};
struct openStatus {
ap_uint<16> sessionID;
bool success;
};
void toe(...
hls::stream<ipTuple>& openConnReq,
hls::stream<openStatus>& openConnRsp,
...);
To close a connection the sessionID has to be provided to the s_axis_close_conn_req
interface. The TCP/IP stack does not provide a notification upon completion of this request, however it is guranteeed that the connection is closed eventually.
Interface definition in HLS:
hls::stream<ap_uint<16> >& closeConnReq,
To open a port to listen on (e.g. as a server), the port number has to be provided to s_axis_listen_port_req
. The port number has to be in range of active ports: 0 - 32767. The TCP stack will respond through the m_axis_listen_port_rsp
interface indicating if the port was set to the listen state succesfully.
Interface definition in HLS:
hls::stream<ap_uint<16> >& listenPortReq,
hls::stream<bool>& listenPortRsp,
The application using the TCP stack can receive notifications through the m_axis_notification
interface. The notifications either indicate that new data is available or that a connection was closed.
Interface definition in HLS:
struct appNotification {
ap_uint<16> sessionID;
ap_uint<16> length;
ap_uint<32> ipAddress;
ap_uint<16> dstPort;
bool closed;
};
hls::stream<appNotification>& notification,
If data is available on a TCP/IP session, i.e. a notification was received. Then this data can be requested through the s_axis_rx_data_req
interface. The data as well as the sessionID are then received through the m_axis_rx_data_rsp_metadata
and m_axis_rx_data_rsp
interface.
Interface definition in HLS:
struct appReadRequest {
ap_uint<16> sessionID;
ap_uint<16> length;
};
hls::stream<appReadRequest>& rxDataReq,
hls::stream<ap_uint<16> >& rxDataRspMeta,
hls::stream<net_axis<WIDTH> >& rxDataRsp,
Waveform of receiving a (data) notification, requesting data, and receiving the data:
When an application wants to transmit data on a TCP connection, it first has to check if enough buffer space is available. This check/request is done through the s_axis_tx_data_req_metadata
interface. If the response through the m_axis_tx_data_rsp
interface from the TCP stack is positive. The application can send the data through the s_axis_tx_data_req
interface. If the response from the TCP stack is negative the application can retry by sending another request on the s_axis_tx_data_req_metadata
interface.
Interface definition in HLS:
struct appTxMeta {
ap_uint<16> sessionID;
ap_uint<16> length;
};
struct appTxRsp {
ap_uint<16> sessionID;
ap_uint<16> length;
ap_uint<30> remaining_space;
ap_uint<2> error;
};
hls::stream<appTxMeta>& txDataReqMeta,
hls::stream<appTxRsp>& txDataRsp,
hls::stream<net_axis<WIDTH> >& txDataReq,
Waveform of requesting a data transmit and transmitting the data.
The new RDMA-version (02/2024) is adapted from the one used in Coyote (https://github.com/fpgasystems/Coyote) and fully compatible to the RoCE-v2 standard, thus able to communicate to standard NICs (such as i.e. Mellanox-cards). It is proven to run at 100 Gbit / s, allowing for low latency and high throughput comparable to the results achievable with mentioned ASIC-based NICs.
The whole included design is defined in a Block Diagram as follows:
The packet processing pipeline is coded in Vitis-HLS and included in "roce_v2_ip", consisting of separate modules for the IPv4-, UDP- and InfiniBand-Headers. In the top-level-module "roce_stack.sv", this pipeline is then combined with HDL-coded ICRC-calculation and RDMA-flow control.
For actual usage of the RDMA-stack, it needs to be integrated into a full FPGA-networking stack and combined with some kind of shell that enables DMA-exchange with the host for both commands and memory access. An example for that is Coyote with a networking stack as depicted in the following block diagram:
The RDMA-stack presented in this repository is the blue roce_stack
. Surrounding modules would need to be provided by users to integrate the RDMA-capability in their projects.
To be able to integrate the RDMA-stack into a shell-design, one must be aware of the essential interfaces. These are the following:
The two ports s_axis_rx
and m_axis_tx
are 512-bit AXI4-Stream interfaces and used to transfer network traffic from the shell to the RDMA-stack. With the Ethernet-Header already processed in earlier parts of the networking environment, the RDMA-core expects a leading IP-Header, followed by a UDP- and InfiniBand-Header, payload and a final ICRC-checksum.
RDMA operates on so-called Queue Pairs at remote communication nodes. The initial connection between Queues has to be established out-of-band (i.e. via TCP/IP) by the hosts. To exchanged meta-information then needs to be communicated to the RDMA-stack via the two meta-interfaces s_axis_qp_interface
and s_axis_qp_conn_interface
. The interface definition in HLS looks like this:
typedef enum {RESET, INIT, READY_RECV, READY_SEND, SQ_ERROR, ERROR} qpState;
struct qpContext {
qpState newState;
ap_uint<24> qp_num;
ap_uint<24> remote_psn;
ap_uint<24> local_psn;
ap_uint<16> r_key;
ap_uint<48> virtual_address;
};
struct ifConnReq {
ap_uint<16> qpn;
ap_uint<24> remote_qpn;
ap_uint<128> remote_ip_address;
ap_uint<16> remote_udp_port;
};
hls::stream<qpContext>& s_axis_qp_interface,
hls::stream<ifConnReq>& s_axis_qp_conn_interface,
The actual RDMA-operations are handled between the shell and the RDMA-core through the interfaces s_rdma_sq
for initiated RDMA-operations and m_rdma_ack
to signal automatically generated ACKs from the stack to the shell.
Definition of s_rdma_sq
:
- 20 Bit
rsrvd
- 64 Bit
message_size
- 64 Bit
local vaddr
- 64 Bit
remote vaddr
- 4 Bit
offs
- 24 Bit
ssn
- 4 Bit
cmplt
- 4 Bit
last
- 4 Bit
mode
- 4 Bit
host
- 12 Bit
qpn
- 8 Bit
opcode
(i.e. RDMA_WRITE, RDMA_READ, RDMA_SEND etc.)
Definition of m_rdma_ack
:
- 24 Bit
ssn
- 4 Bit
vfid
- Coyote-specific - 8 Bit
pid
- Coyote-specific - 4 Bit
cmplt
- 4 Bit
rd
The RDMA stack as published here and originally developed for use with the Coyote-shell is designed to use the QDMA IP-core. Therefore, the memory-control interfaces m_rdma_rd_req
and m_rdma_wr_req
are designed to hold all information required for communication with those cores. The two data interfaces for transportation of memory content m_axis_rdma_wr
and s_axis_rdma_rd
are 512-bit AXI4-Stream interfaces.
Definition of m_rdma_rd_req
/ m_rdma_wr_req
:
- 4 Bit
vfid
- 48 Bit
vaddr
- 4 Bit
sync
- 4 Bit
stream
- 8 Bit
pid
- 28 Bit
len
- 4 Bit
host
- 12 Bit
dest
- 4 Bit
ctl
The following flow chart shows an exemplaric RDMA WRITE-exchange between a remote node with an ASIC-based NIC and a local node with a FPGA-NIC implementing the RDMA-stack. It depicts the FPGA-internal communication between RDMA-stack and Shell as well as the network data-exchange between the two nodes:
-
D. Sidler, G. Alonso, M. Blott, K. Karras et al., Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware, in FCCM’15, Paper, Slides
-
D. Sidler, Z. Istvan, G. Alonso, Low-Latency TCP/IP Stack for Data Center Applications, in FPL'16, Paper
-
D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, G. Alonso, StRoM: smart remote memory, in EuroSys'20, Paper
If you use the TCP/IP or RDMA stacks in your project please cite one of the following papers and/or link to the github project:
@inproceedings{DBLP:conf/fccm/SidlerABKVC15,
author = {David Sidler and
Gustavo Alonso and
Michaela Blott and
Kimon Karras and
Kees A. Vissers and
Raymond Carley},
title = {Scalable 10Gbps {TCP/IP} Stack Architecture for Reconfigurable Hardware},
booktitle = {23rd {IEEE} Annual International Symposium on Field-Programmable Custom
Computing Machines, {FCCM} 2015, Vancouver, BC, Canada, May 2-6, 2015},
pages = {36--43},
publisher = {{IEEE} Computer Society},
year = {2015},
doi = {10.1109/FCCM.2015.12}
@inproceedings{DBLP:conf/fpl/SidlerIA16,
author = {David Sidler and
Zsolt Istv{\'{a}}n and
Gustavo Alonso},
title = {Low-latency {TCP/IP} stack for data center applications},
booktitle = {26th International Conference on Field Programmable Logic and Applications,
{FPL} 2016, Lausanne, Switzerland, August 29 - September 2, 2016},
pages = {1--4},
publisher = {{IEEE}},
year = {2016},
doi = {10.1109/FPL.2016.7577319}
}
@inproceedings{DBLP:conf/eurosys/SidlerWCKA20,
author = {David Sidler and
Zeke Wang and
Monica Chiosa and
Amit Kulkarni and
Gustavo Alonso},
title = {StRoM: smart remote memory},
booktitle = {EuroSys '20: Fifteenth EuroSys Conference 2020, Heraklion, Greece,
April 27-30, 2020},
pages = {29:1--29:16},
publisher = {{ACM}},
year = {2020},
doi = {10.1145/3342195.3387519}
}
@PHDTHESIS{sidler2019innetworkdataprocessing,
author = {Sidler, David},
publisher = {ETH Zurich},
year = {2019-09},
copyright = {In Copyright - Non-Commercial Use Permitted},
title = {In-Network Data Processing using FPGAs},
}
@INPROCEEDINGS{sidler2020strom,
author = {Sidler, David and Wang, Zeke and Chiosa, Monica and Kulkarni, Amit and Alonso, Gustavo},
booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},
title = {StRoM: Smart Remote Memory},
doi = {10.1145/3342195.3387519},
}
- David Sidler, Systems Group, ETH Zurich
- Monica Chiosa, Systems Group, ETH Zurich
- Fabio Maschi, Systems Group, ETH Zurich
- Zhenhao He, Systems Group, ETH Zurich
- Mario Ruiz, HPCN Group of UAM, Spain
- Kimon Karras, former Researcher at Xilinx Research, Dublin
- Lisa Liu, Xilinx Research, Dublin