-
Notifications
You must be signed in to change notification settings - Fork 5
EtherateMT Transmit Overview
This is a high level description of the path network data takes from a user-land program to a network device (I have extended the text from the PackageCloud Blog so not all of this writing is my own):
-
Setup a socket with a call to
socket()
which maps to a system call (in glibc). The socket familyAF_PACKET
is used so ultimatelystruct net_proto_family packet_family_ops
is looked up andpacket_create()
insideaf_packet.c
is called. -
Data is written to a
socket()
using a call tosendto()
orsendmsg()
etc. which map to sys calls e.g.SYSCALL_DEFINE6(sendto)
. -
Data passes through the socket subsystem on to the socket’s protocol family system (in the EtherateMT case,
AF_PACKET
which is defined inaf_packet.c
):sendto()
>sock_sendmsg()
>sock_sendmsg_nosec()
>sock->ops->sendmsg()
=packet_sendmsg()
>tpacket_snd()
(this lands processing of thesendto()
call inaf_packet.c
). -
The output queue is chosen using XPS (if enabled) or a hash function is used. In the case of IXBGE drivers for example, when not using FCoE,
ixgbe_select_queue()
actually falls back to the same hash function__packet_pick_tx_queue()
. Processing is in Kernel-land now and using a CPU assigned by XPS (if configured); spin locks are used in af_packet.c (mostly for Rx rather than Tx) which means that the Kernel thread (processing thissendto()
call) is stalling the CPU thread even though EtherateMT calledsendto(..., MSG_DONTWAIT, ...)
. From near to the start oftpacket_snd()
thepacket_sock
struct is locked withmutex_lock(&po->pg_vec_lock);
(which contains the Tx/Rx rings) until after the do loop has completed emptying the Tx ring. -
The packet socket transmit function is called
po->xmit()
.-
If the QDISC bypass is not enabled, then the data is passed on to the queue discipline (qdisc) attached to the output device. EtherateMT calls
setsockopt(PACKET_QDISC_BYPASS)
>packet_setsockopt(PACKET_QDISC_BYPASS)
if supported so that queuing disciplines are skipped. -
If enabled the qdisc will either transmit the data directly if it can, or queue it up to be sent during the
NET_TX
softirq. -
Eventually the data is handed down to the driver from the qdisc.
-
-
The device driver’s transmit function is called,
ndo_start_xmit()
. -
The driver creates the needed DMA mappings so the device can read the data from RAM. The drive checks if the number of Tx descriptors needed to transmit the
sk_buff
will fit into the transmit queue:igb_maybe_stop_tx()
. -
The driver signals to the device that the data is ready to be transmit.
-
The device fetches the data from RAM and transmits it.
-
Once transmission is complete the device raises an interrupt to signal transmit completion.
-
The driver’s registered IRQ handler for transmit completion runs. For many devices this handler simply triggers the NAPI poll loop to start running via the
NET_RX
softirq. -
The poll function runs via a softIRQ and calls down into the driver to unmap DMA regions and free packet data (
sk_buff
's can now be freed which in turn will free up data in the socket queue).
When sending data at a high rate the NET_RX
IRQ can be seen to be using a lot of CPU cycles. This is because the post Tx packet cleanup process is performed when a NET_RX
IRQ is fired off, not a NET_TX
IRQ (for example, deleting skb
's and setting the packet block status for a successfully Tx'ed packet within the PACKET_MMAP
ring to TP_STATUS_AVAILABLE
). This happens in tpacket_destruct_skb()
.
tpacket_snd()
is the only function is the send path which uses a mutex lock. From near to the start of tpacket_snd()
the packet_sock
struct is locked with mutex_lock(&po->pg_vec_lock);
(which contains the Tx/Rx rings) until after the do loop has completed emptying the Tx ring.
Why can spin locks be seen when sending data at a high rate (meaning a high number of NET_RX
IRQs are being processed), isn't the Tx path lock free?
- Introduction
- References
- EtherateMT Design Notes and Usage:
- EtherateMT Design Overview
- EtherateMT Socket Overview
- EtherateMT PACKET_MMAP Mode
- EtherateMT Transmit Overview
- EtherateMT Transmission - AF_PACKET Deep Dive:
- EtherateMT Transmission - AF_XDP