Skip to content
KaiGai Kohei edited this page Oct 26, 2022 · 14 revisions

What is Pcap2Arrow

Pcap2Arrow is a standalone command-line tool that captures network packets from a network interface (or read preliminary captured PCAP files) and convert them into Apache Arrow files on the fly.

Apache Arrow is a column-oriented data format designed for analytics workloads, supported by various data analytics tools like: Python (PyArrow), PG-Strom (Arrow_Fdw), and so on...

System Overvoew

Characteristics

  • Pcap2Arrow internally uses PF_Ring to capture high-speed network packets with minimum losses.
  • Designed to maximize the capability of multi-core CPUs with multi-threading.
  • Capable to capture up to 100Gb traffic (depending on your hardware).
  • Support of structured schema for TCP,UDP,ICMP on IPv4/IPv6 for each.
  • No need to import the captured data to database systems, just file copying.

Installation

Prerequisites

  • libpcap
    • run dnf install libpcap-devel
  • pfring (library and kernel module)
  • Linux operating system supported distribution by PF_RING
  • GNU C Compiler

Step to install

$ git clone https://github.com/heterodb/pg-strom.git
$ cd pg-strom
$ make -C pcap2arrow
$ sudo make -C pcap2arrow install

Usage

$ ./pcap2arrow -h
usage: pcap2arrow [OPTIONS] [<pcap files>...]

OPTIONS:
  -i|--input=DEVICE
       specifies a network device to capture packet.
     --num-queues=N_QUEUE : num of PF-RING queues.
  -o|--output=<output file; with format>
       filename format can contains:
         %i : interface name
         %Y : year in 4-digits
         %y : year in 2-digits
         %m : month in 2-digits
         %d : day in 2-digits
         %H : hour in 2-digits
         %M : minute in 2-digits
         %S : second in 2-digits
         %q : sequence number for each output files
       default is '/tmp/pcap_%i_%y%m%d_%H%M%S.arrow'
  -f|--force : overwrite file, even if exists
     --no-payload: disables capture of payload
     --parallel-write=N_FILES
       opens multiple output files simultaneously (default: 1)
     --chunk-size=SIZE : size of record batch (default: 128MB)
     --direct-io : enables O_DIRECT for write-i/o
  -l|--limit=LIMIT : (default: no limit)
  -p|--protocol=PROTO
       PROTO is a comma separated string contains
       the following tokens:
         tcp4, udp4, icmp4, ipv4, tcp6, udp6, icmp6, ipv6
       (default: 'tcp4,udp4,icmp4')
     --composite-options:
        write out IPv4,IPv6 and TCP options as an array of composite values
     --interface-id
        enables the field to embed interface-id attribute, if source is
        PCAP-NG files. Elsewhere, NULL shall be assigned here.
  -r|--rule=RULE : packet filtering rules
       (default: none; valid only capturing mode)
  -s|--stat=INTERVAL
       enables to print statistics per INTERVAL
  -t|--threads=N_THREADS
     --pcap-threads=N_THREADS
  -h|--help    : shows this message

  Copyright (C) 2020-2021 HeteroDB,Inc <[email protected]>
  Copyright (C) 2020-2021 KaiGai Kohei <[email protected]>

Examples

  • Capture from ens3, then write out packets to /tmp/mytest.arrow (overwrite if exists) with schema definition of TCP and UDP on IPv4
    • $ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4
  • In addition to the above, captures only TCP-SYN packet to the port-22; filtered out by the kernel module (not application).
    • $ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --rule '(tcp[13] & 2 != 0) and tcp dst port 22'
  • In addition to the above, write out only headers. All the payload portion shall be dropped.
    • $ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --only-headers --rule '(tcp[13] & 2 != 0) and tcp dst port 22'
  • In addition to the first example, it shows statistics per 2 second interval.
    • $ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --stat 2
  • In addition to the above, use O_DIRECT mode for faster write I/O. (Large memory buffer, NVME-SSD grade storage, and more than 10 CPU threads are recommended)
    • $ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --stat 2 --direct-io
  • Convert the preliminary captured PCAP files. Multiple input PCAP files are consolidated to the single output specified by -o FILENAME.
    • ./pcap2arrow -f -o /tmp/mytest.arrow ~/pcaps/20210216_mynetwork.*.pcap

Schema Definition

Field Name Data Type Memo
timestamp Timestamp(unit=us) timestamp when packet is captured
dst_mac FixedSizeBinary(width=6) destination MAC address
src_mac FixedSizeBinary(width=6) source MAC address
ether_type Uint16 EtherType; IPv4=0x0800, IPv6=0x86dd

Fields if IPv4 is available

Field Name Data Type Memo
tos Uint8 Type of Service field.
ip_length Uint16 Length of IP packet including the header portion.
identifier Uint16 Identifier of the IP packet assigned by the source.
fragment Uint16 Fragment offset; including the flag bits (3-bits).
ttl Uint8 Time to Live
ip_checksum Uint16 IP header checksum.
src_addr FixedSizeBinary(width=4) Source IP address.
dst_addr FixedSizeBinary(width=4) Destination IP address.
ip_options OptionItems variable length IP options, if any. Data type depends on --composite-options.
Look at the complete list of IPv4 options here.

Fields if IPv6 is available

Field Name Data Type Memo
traffic_class Uint8 IPv6 Traffic class
flow_label Uint32 Flow label
hop_limit Uint8 Hop limit
src_addr6 FixedSizeBinary(width=16) Source IPv6 Address
dst_addr6 FixedSizeBinary(width=16) Destination IPv6 Address
ip6_options OptionItems variable length IPv6 options, if any. Data type depends on --composite-options.
Look at the complete list of IPv6 options here.

Fields if either IPv4 or IPv6 are available

Field Name Data Type Memo
protocol Uint8 IP protocol number of the payload.

Fields if TCP or UDP are available

Field Name Data Type Memo
src_port Uint16 Source port number
dst_port Uint16 Destination port number

Fields if TCP is available

Field Name Data Type Memo
seq_nr Uint32 Sequence number
ack_nr Uint32 Acknowledgment number (if ACK set)
tcp_flags Uint16 TCP flags
FIN=0x0001, SYN=0x0002, RST=0x0004,PSH=0x0008,ACK=0x0010,URG=0x0020,ECE=0x0040,CWR=0x0080,NS=0x0x0100
window_sz Uint16 TCP window size
tcp_checksum Uint16 TCP checksum
urgent_ptr Uint16 Urgent pointer (if URG set)
tcp_options OptionItems TCP options

Fields if UDP is available

Field Name Data Type Memo
udp_length Uint16 Length of UDP portion (header+payload)
udp_checksum Uint16 UDP checksum

Fields if ICMP is available

Field Name Data Type Memo
icmp_type Uint8 ICMP Type
icmp_code Uint8 ICMP Subtype
icmp_checksum Uint16 ICMP checksum

Fields if Payload is available

Field Name Data Type Memo
payload Binary Payload of the packet. If L3 is known protocol (tcp,udp or icmp), it is L3 payload. Elsewhere, L2 payload.

IPv4/IPv6/TCP options, if --composite-options is given

When --composite-options is given, Pcap2Arrow writes out IPv4/IPv6/TCP options field, that can contains multiple sub-fields, as an array of composite values with (Uint8,Binary).

The sub-fields in IPv4 and TCP option have its sub-field type at the first octet, and length of the sub-field at the second octet, then various length sub-field's payload follows. It shall be structured to a pair of composite value: (Uint8 option_type, Binary option_data), and xxx_options field packs them as an array of the composite values if packet contains any options.

The sub-fields in TPv6 options is designed like as a single-linked list. The fixed-length portion of IPv6 header contains identifier of the next sub-field that follows the fixed-length portion. IPv6 header can contains the following options fields, and their layouts are described at the following links.

Fields if --interface-id option was supplied

Field Name Data Type Memo
interface_id Uint32 network interface identifier embedded in the interface description block (IDB) of PCAP-NG files. Elsewhere, NULL shall be assigned.

Example

List network interfaces

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 1c:34:da:76:bb:8e brd ff:ff:ff:ff:ff:ff
    inet 192.168.55.106/24 brd 192.168.55.255 scope global noprefixroute ens3
       valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:25:90:5e:d0:66 brd ff:ff:ff:ff:ff:ff
    inet 192.168.77.106/24 brd 192.168.77.255 scope global noprefixroute eno1
       valid_lft forever preferred_lft forever
    inet6 2405:6580:3501:1d00:5c4c:b395:c5d4:8249/64 scope global dynamic noprefixroute
       valid_lft 2591929sec preferred_lft 604729sec
    inet6 fe80::4ef1:6a57:e0b9:a548/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

Try to capture ens3, with schema definition for TPC/IPv4

$ sudo pcap2arrow -i ens3 -p tcp4 -o /tmp/mytest.arrow --stat 2 --composite-options
2021-02-16   <# Recv> <# Drop> <Total Sz> <# IPv4>  <# TCP>
 16:01:16          0        0         0B        0        0
 16:01:18         26        0      4953B       26       26
 16:01:20         18        0      1404B       18       18
 16:01:22         16        0      1236B       16       16
 16:01:24          9        0       702B        9        9
 16:01:26         29        0      5211B       29       29
 16:01:28          8        0       660B        8        8
^CStats total:
Recv packets: 106
Drop packets: 0
Total bytes: 14166
IPv4 packets: 106
TCP packets: 106
$ ls -l /tmp/mytest.arrow
-rw-r--r--. 1 root root 25190 Feb 16 16:01 /tmp/mytest.arrow

Look at the capture data using PyArrow

$ python3
Python 3.6.8 (default, Aug 24 2020, 17:57:11)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader('/tmp/mytest.arrow')
>>> X.schema
timestamp: timestamp[us]
dst_mac: fixed_size_binary[6]
  -- field metadata --
  pg_type: 'pg_catalog.macaddr'
src_mac: fixed_size_binary[6]
  -- field metadata --
  pg_type: 'pg_catalog.macaddr'
ether_type: uint16
tos: uint8
ip_length: uint16
identifier: uint16
fragment: uint16
ttl: uint8
ip_checksum: uint16
src_addr: fixed_size_binary[4]
  -- field metadata --
  pg_type: 'pg_catalog.inet'
dst_addr: fixed_size_binary[4]
  -- field metadata --
  pg_type: 'pg_catalog.inet'
ip_options: list<__ip_options: struct<opt_code: uint8, opt_data: binary>>
  child 0, __ip_options: struct<opt_code: uint8, opt_data: binary>
      child 0, opt_code: uint8
      child 1, opt_data: binary
protocol: uint8
src_port: uint16
dst_port: uint16
seq_nr: uint32
ack_nr: uint32
tcp_flags: uint16
window_sz: uint16
tcp_checksum: uint16
urgent_ptr: uint16
tcp_options: list<__tcp_options: struct<opt_code: uint8, opt_data: binary>>
  child 0, __tcp_options: struct<opt_code: uint8, opt_data: binary>
      child 0, opt_code: uint8
      child 1, opt_data: binary
payload: binary
>>> X.num_record_batches
1

It looks the captured data (/tmp/mytest.arrow) has proper schema definition and 1 record batch.

>>> X.get_record_batch(0).to_pandas()
                     timestamp                dst_mac             src_mac  ...  urgent_ptr                                        tcp_options                                            payload
0   2021-02-16 07:01:17.410903  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0  [{'opt_code': 8, 'opt_data': b'i#\xb4\xc3\x17\...  b'\xefW\xb4\xba\x8a.\xc3\x19\xde0\xdePKY0\xc9\...
1   2021-02-16 07:01:23.620916  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0                                               None  b'\x81f\x00\x16f\xda\x00\x1bV\xd7\xfd\xf8\x80\...
2   2021-02-16 07:01:23.621060  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0                                               None  b'\x81f\x00\x16f\xda\x00\x1bV\xd7\xfeh\x80\x10...
3   2021-02-16 07:01:25.988754  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0                                               None  b'\x81h\x00\x16||\x86\x19\xdb0L\x8b\x80\x10\x0...
4   2021-02-16 07:01:21.475102  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0                                               None  b'\x81f\x00\x16f\xd9\xff\x8bV\xd7\xf3p\x80\x10...
..                         ...                    ...                 ...  ...         ...                                                ...                                                ...
101 2021-02-16 07:01:26.055332  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0  [{'opt_code': 8, 'opt_data': b'i#\xd6\x87\x17\...  b'+\x8a\x05\r\xb4\x98-B>f\x92\xf1G\xacW%)_\xbd...
102 2021-02-16 07:01:26.104838  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0                                               None  b'\x81h\x00\x16||\x8b\t\xdb0Q\xbb\x80\x10\x01\...
103 2021-02-16 07:01:26.141834  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0                                               None  b'\x81h\x00\x16||\x8c\xc5\xdb0SC\x80\x10\x01\x...
104 2021-02-16 07:01:26.669692  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0  [{'opt_code': 8, 'opt_data': b'i#\xd8\xed\x17\...  b"\x8e90\xc9\x83\x108z\xe0||\xac'\x8a\xdaY\x92...
105 2021-02-16 07:01:17.341717  b'\x1c4\xdav\xbb\x8e'  b'\x1c4\xdavF\x98'  ...           0  [{'opt_code': 8, 'opt_data': b'i#\xb4}\x17\x11...  b'\x00\x00\x05L\x05\x14\xa8U\\\x1bA\x0c\xbfD_\...

[106 rows x 24 columns]

Once it is transformed to usual data frame, you can search, analyze and summary the captured data using your familier tools, like PostgreSQL or Python.

About this note

  • Author: KaiGai Kohei
  • Last Update: 17th-Feb-2021
  • Change Logs:
    • Initial revision (17th-Feb-2021)