Skip to content

EtherateMT Transmit Overview

James Bensley edited this page Sep 11, 2017 · 4 revisions

This is a high level description of the path network data takes from a user-land program to a network device (I have extended the text from the PackageCloud Blog so not all of this writing is my own):

  1. Setup a socket with a call to socket() which maps to a system call (in glibc). The socket family AF_PACKET is used so ultimately struct net_proto_family packet_family_ops is looked up and packet_create() inside af_packet.c is called.

  2. Data is written to a socket() using a call to sendto() or sendmsg() etc. which map to sys calls e.g. SYSCALL_DEFINE6(sendto).

  3. Data passes through the socket subsystem on to the socket’s protocol family system (in the EtherateMT case, AF_PACKET which is defined in af_packet.c): sendto() > sock_sendmsg() > sock_sendmsg_nosec() > sock->ops->sendmsg() = packet_sendmsg() > tpacket_snd() (this lands processing of the sendto() call in af_packet.c).

  4. The output queue is chosen using XPS (if enabled) or a hash function is used. In the case of IXBGE drivers for example, when not using FCoE, ixgbe_select_queue() actually falls back to the same hash function __packet_pick_tx_queue(). Processing is in Kernel-land now and using a CPU assigned by XPS (if configured); spin locks are used in af_packet.c (mostly for Rx rather than Tx) which means that the Kernel thread (processing this sendto() call) is stalling the CPU thread even though EtherateMT called sendto(..., MSG_DONTWAIT, ...). From near to the start of tpacket_snd() the packet_sock struct is locked with mutex_lock(&po->pg_vec_lock); (which contains the Tx/Rx rings) until after the do loop has completed emptying the Tx ring.

  5. The packet socket transmit function is called po->xmit().

    • If the QDISC bypass is not enabled, then the data is passed on to the queue discipline (qdisc) attached to the output device. EtherateMT calls setsockopt(PACKET_QDISC_BYPASS) > packet_setsockopt(PACKET_QDISC_BYPASS) if supported so that queuing disciplines are skipped.

    • If enabled the qdisc will either transmit the data directly if it can, or queue it up to be sent during the NET_TX softirq.

    • Eventually the data is handed down to the driver from the qdisc.

  6. The device driver’s transmit function is called, ndo_start_xmit().

  7. The driver creates the needed DMA mappings so the device can read the data from RAM. The drive checks if the number of Tx descriptors needed to transmit the sk_buff will fit into the transmit queue: igb_maybe_stop_tx().

  8. The driver signals to the device that the data is ready to be transmit.

  9. The device fetches the data from RAM and transmits it.

  10. Once transmission is complete the device raises an interrupt to signal transmit completion.

  11. The driver’s registered IRQ handler for transmit completion runs. For many devices this handler simply triggers the NAPI poll loop to start running via the NET_RX softirq.

  12. The poll function runs via a softIRQ and calls down into the driver to unmap DMA regions and free packet data (sk_buff's can now be freed which in turn will free up data in the socket queue).

When sending data at a high rate the NET_RX IRQ can be seen to be using a lot of CPU cycles. This is because the post Tx packet cleanup process is performed when a NET_RX IRQ is fired off, not a NET_TX IRQ (for example, deleting skb's and setting the packet block status for a successfully Tx'ed packet within the PACKET_MMAP ring to TP_STATUS_AVAILABLE). This happens in tpacket_destruct_skb().

tpacket_snd() is the only function is the send path which uses a mutex lock. From near to the start of tpacket_snd() the packet_sock struct is locked with mutex_lock(&po->pg_vec_lock); (which contains the Tx/Rx rings) until after the do loop has completed emptying the Tx ring.

Why can spin locks be seen when sending data at a high rate (meaning a high number of NET_RX IRQs are being processed), isn't the Tx path lock free?