-
Notifications
You must be signed in to change notification settings - Fork 162
804: Pcap2Arrow
Pcap2Arrow
is a standalone command-line tool that captures network packets from a network interface (or read preliminary captured PCAP files) and convert them into Apache Arrow files on the fly.
Apache Arrow is a column-oriented data format designed for analytics workloads, supported by various data analytics tools like: Python (PyArrow), PG-Strom (Arrow_Fdw), and so on...
-
Pcap2Arrow
internally uses PF_Ring to capture high-speed network packets with minimum losses. - Designed to maximize the capability of multi-core CPUs with multi-threading.
- Capable to capture up to 100Gb traffic (depending on your hardware).
- Support of structured schema for TCP,UDP,ICMP on IPv4/IPv6 for each.
- No need to import the captured data to database systems, just file copying.
-
libpcap
- run
dnf install libpcap-devel
- run
-
pfring
(library and kernel module) - Linux operating system supported distribution by PF_RING
- GNU C Compiler
$ git clone https://github.com/heterodb/pg-strom.git
$ cd pg-strom
$ make -C pcap2arrow
$ sudo make -C pcap2arrow install
$ ./pcap2arrow -h
usage: pcap2arrow [OPTIONS] [<pcap files>...]
OPTIONS:
-i|--input=DEVICE
specifies a network device to capture packet.
--num-queues=N_QUEUE : num of PF-RING queues.
-o|--output=<output file; with format>
filename format can contains:
%i : interface name
%Y : year in 4-digits
%y : year in 2-digits
%m : month in 2-digits
%d : day in 2-digits
%H : hour in 2-digits
%M : minute in 2-digits
%S : second in 2-digits
%q : sequence number for each output files
default is '/tmp/pcap_%i_%y%m%d_%H%M%S.arrow'
-f|--force : overwrite file, even if exists
--no-payload: disables capture of payload
--parallel-write=N_FILES
opens multiple output files simultaneously (default: 1)
--chunk-size=SIZE : size of record batch (default: 128MB)
--direct-io : enables O_DIRECT for write-i/o
-l|--limit=LIMIT : (default: no limit)
-p|--protocol=PROTO
PROTO is a comma separated string contains
the following tokens:
tcp4, udp4, icmp4, ipv4, tcp6, udp6, icmp6, ipv6
(default: 'tcp4,udp4,icmp4')
--composite-options:
write out IPv4,IPv6 and TCP options as an array of composite values
--interface-id
enables the field to embed interface-id attribute, if source is
PCAP-NG files. Elsewhere, NULL shall be assigned here.
-r|--rule=RULE : packet filtering rules
(default: none; valid only capturing mode)
-s|--stat=INTERVAL
enables to print statistics per INTERVAL
-t|--threads=N_THREADS
--pcap-threads=N_THREADS
-h|--help : shows this message
Copyright (C) 2020-2021 HeteroDB,Inc <[email protected]>
Copyright (C) 2020-2021 KaiGai Kohei <[email protected]>
- Capture from
ens3
, then write out packets to/tmp/mytest.arrow
(overwrite if exists) with schema definition of TCP and UDP on IPv4$ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4
- In addition to the above, captures only TCP-SYN packet to the port-22; filtered out by the kernel module (not application).
$ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --rule '(tcp[13] & 2 != 0) and tcp dst port 22'
- In addition to the above, write out only headers. All the payload portion shall be dropped.
$ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --only-headers --rule '(tcp[13] & 2 != 0) and tcp dst port 22'
- In addition to the first example, it shows statistics per 2 second interval.
$ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --stat 2
- In addition to the above, use
O_DIRECT
mode for faster write I/O. (Large memory buffer, NVME-SSD grade storage, and more than 10 CPU threads are recommended)$ sudo ./pcap2arrow -i ens3 -f -o /tmp/mytest.arrow -p tcp4,udp4 --stat 2 --direct-io
- Convert the preliminary captured PCAP files. Multiple input PCAP files are consolidated to the single output specified by
-o FILENAME
../pcap2arrow -f -o /tmp/mytest.arrow ~/pcaps/20210216_mynetwork.*.pcap
Field Name | Data Type | Memo |
---|---|---|
timestamp | Timestamp(unit=us) | timestamp when packet is captured |
dst_mac | FixedSizeBinary(width=6) | destination MAC address |
src_mac | FixedSizeBinary(width=6) | source MAC address |
ether_type | Uint16 |
EtherType; IPv4=0x0800 , IPv6=0x86dd
|
Fields if IPv4 is available
Field Name | Data Type | Memo |
---|---|---|
tos | Uint8 | Type of Service field. |
ip_length | Uint16 | Length of IP packet including the header portion. |
identifier | Uint16 | Identifier of the IP packet assigned by the source. |
fragment | Uint16 | Fragment offset; including the flag bits (3-bits). |
ttl | Uint8 | Time to Live |
ip_checksum | Uint16 | IP header checksum. |
src_addr | FixedSizeBinary(width=4) | Source IP address. |
dst_addr | FixedSizeBinary(width=4) | Destination IP address. |
ip_options | OptionItems | variable length IP options, if any. Data type depends on --composite-options .Look at the complete list of IPv4 options here. |
Fields if IPv6 is available
Field Name | Data Type | Memo |
---|---|---|
traffic_class | Uint8 | IPv6 Traffic class |
flow_label | Uint32 | Flow label |
hop_limit | Uint8 | Hop limit |
src_addr6 | FixedSizeBinary(width=16) | Source IPv6 Address |
dst_addr6 | FixedSizeBinary(width=16) | Destination IPv6 Address |
ip6_options | OptionItems | variable length IPv6 options, if any. Data type depends on --composite-options .Look at the complete list of IPv6 options here. |
Field Name | Data Type | Memo |
---|---|---|
protocol | Uint8 | IP protocol number of the payload. |
Field Name | Data Type | Memo |
---|---|---|
src_port | Uint16 | Source port number |
dst_port | Uint16 | Destination port number |
Fields if TCP is available
Field Name | Data Type | Memo |
---|---|---|
seq_nr | Uint32 | Sequence number |
ack_nr | Uint32 | Acknowledgment number (if ACK set) |
tcp_flags | Uint16 | TCP flagsFIN=0x0001 , SYN=0x0002 , RST=0x0004 ,PSH=0x0008 ,ACK=0x0010 ,URG=0x0020 ,ECE=0x0040 ,CWR=0x0080 ,NS=0x0x0100
|
window_sz | Uint16 | TCP window size |
tcp_checksum | Uint16 | TCP checksum |
urgent_ptr | Uint16 | Urgent pointer (if URG set) |
tcp_options | OptionItems | TCP options |
Fields if UDP is available
Field Name | Data Type | Memo |
---|---|---|
udp_length | Uint16 | Length of UDP portion (header+payload) |
udp_checksum | Uint16 | UDP checksum |
Fields if ICMP is available
Field Name | Data Type | Memo |
---|---|---|
icmp_type | Uint8 | ICMP Type |
icmp_code | Uint8 | ICMP Subtype |
icmp_checksum | Uint16 | ICMP checksum |
Field Name | Data Type | Memo |
---|---|---|
payload | Binary | Payload of the packet. If L3 is known protocol (tcp,udp or icmp), it is L3 payload. Elsewhere, L2 payload. |
When --composite-options
is given, Pcap2Arrow
writes out IPv4/IPv6/TCP options field, that can contains multiple sub-fields, as an array of composite values with (Uint8,Binary)
.
The sub-fields in IPv4 and TCP option have its sub-field type at the first octet, and length of the sub-field at the second octet, then various length sub-field's payload follows. It shall be structured to a pair of composite value: (Uint8 option_type, Binary option_data)
, and xxx_options
field packs them as an array of the composite values if packet contains any options.
The sub-fields in TPv6 options is designed like as a single-linked list. The fixed-length portion of IPv6 header contains identifier of the next
sub-field that follows the fixed-length portion. IPv6 header can contains the following options fields, and their layouts are described at the following links.
- Hop-by-Hop (type=0)
- Routing (type=43)
- Fragment (type=44)
- IPsec Authentication Header(type=51)
- Destination Options (type=60)
- Mobility (type=135)
- Host Identity Protocol (type=139)
- Shim6 Protocol (type=140)
Field Name | Data Type | Memo |
---|---|---|
interface_id | Uint32 | network interface identifier embedded in the interface description block (IDB) of PCAP-NG files. Elsewhere, NULL shall be assigned. |
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
link/ether 1c:34:da:76:bb:8e brd ff:ff:ff:ff:ff:ff
inet 192.168.55.106/24 brd 192.168.55.255 scope global noprefixroute ens3
valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:25:90:5e:d0:66 brd ff:ff:ff:ff:ff:ff
inet 192.168.77.106/24 brd 192.168.77.255 scope global noprefixroute eno1
valid_lft forever preferred_lft forever
inet6 2405:6580:3501:1d00:5c4c:b395:c5d4:8249/64 scope global dynamic noprefixroute
valid_lft 2591929sec preferred_lft 604729sec
inet6 fe80::4ef1:6a57:e0b9:a548/64 scope link noprefixroute
valid_lft forever preferred_lft forever
$ sudo pcap2arrow -i ens3 -p tcp4 -o /tmp/mytest.arrow --stat 2 --composite-options
2021-02-16 <# Recv> <# Drop> <Total Sz> <# IPv4> <# TCP>
16:01:16 0 0 0B 0 0
16:01:18 26 0 4953B 26 26
16:01:20 18 0 1404B 18 18
16:01:22 16 0 1236B 16 16
16:01:24 9 0 702B 9 9
16:01:26 29 0 5211B 29 29
16:01:28 8 0 660B 8 8
^CStats total:
Recv packets: 106
Drop packets: 0
Total bytes: 14166
IPv4 packets: 106
TCP packets: 106
$ ls -l /tmp/mytest.arrow
-rw-r--r--. 1 root root 25190 Feb 16 16:01 /tmp/mytest.arrow
$ python3
Python 3.6.8 (default, Aug 24 2020, 17:57:11)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader('/tmp/mytest.arrow')
>>> X.schema
timestamp: timestamp[us]
dst_mac: fixed_size_binary[6]
-- field metadata --
pg_type: 'pg_catalog.macaddr'
src_mac: fixed_size_binary[6]
-- field metadata --
pg_type: 'pg_catalog.macaddr'
ether_type: uint16
tos: uint8
ip_length: uint16
identifier: uint16
fragment: uint16
ttl: uint8
ip_checksum: uint16
src_addr: fixed_size_binary[4]
-- field metadata --
pg_type: 'pg_catalog.inet'
dst_addr: fixed_size_binary[4]
-- field metadata --
pg_type: 'pg_catalog.inet'
ip_options: list<__ip_options: struct<opt_code: uint8, opt_data: binary>>
child 0, __ip_options: struct<opt_code: uint8, opt_data: binary>
child 0, opt_code: uint8
child 1, opt_data: binary
protocol: uint8
src_port: uint16
dst_port: uint16
seq_nr: uint32
ack_nr: uint32
tcp_flags: uint16
window_sz: uint16
tcp_checksum: uint16
urgent_ptr: uint16
tcp_options: list<__tcp_options: struct<opt_code: uint8, opt_data: binary>>
child 0, __tcp_options: struct<opt_code: uint8, opt_data: binary>
child 0, opt_code: uint8
child 1, opt_data: binary
payload: binary
>>> X.num_record_batches
1
It looks the captured data (/tmp/mytest.arrow
) has proper schema definition and 1 record batch.
>>> X.get_record_batch(0).to_pandas()
timestamp dst_mac src_mac ... urgent_ptr tcp_options payload
0 2021-02-16 07:01:17.410903 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 [{'opt_code': 8, 'opt_data': b'i#\xb4\xc3\x17\... b'\xefW\xb4\xba\x8a.\xc3\x19\xde0\xdePKY0\xc9\...
1 2021-02-16 07:01:23.620916 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 None b'\x81f\x00\x16f\xda\x00\x1bV\xd7\xfd\xf8\x80\...
2 2021-02-16 07:01:23.621060 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 None b'\x81f\x00\x16f\xda\x00\x1bV\xd7\xfeh\x80\x10...
3 2021-02-16 07:01:25.988754 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 None b'\x81h\x00\x16||\x86\x19\xdb0L\x8b\x80\x10\x0...
4 2021-02-16 07:01:21.475102 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 None b'\x81f\x00\x16f\xd9\xff\x8bV\xd7\xf3p\x80\x10...
.. ... ... ... ... ... ... ...
101 2021-02-16 07:01:26.055332 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 [{'opt_code': 8, 'opt_data': b'i#\xd6\x87\x17\... b'+\x8a\x05\r\xb4\x98-B>f\x92\xf1G\xacW%)_\xbd...
102 2021-02-16 07:01:26.104838 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 None b'\x81h\x00\x16||\x8b\t\xdb0Q\xbb\x80\x10\x01\...
103 2021-02-16 07:01:26.141834 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 None b'\x81h\x00\x16||\x8c\xc5\xdb0SC\x80\x10\x01\x...
104 2021-02-16 07:01:26.669692 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 [{'opt_code': 8, 'opt_data': b'i#\xd8\xed\x17\... b"\x8e90\xc9\x83\x108z\xe0||\xac'\x8a\xdaY\x92...
105 2021-02-16 07:01:17.341717 b'\x1c4\xdav\xbb\x8e' b'\x1c4\xdavF\x98' ... 0 [{'opt_code': 8, 'opt_data': b'i#\xb4}\x17\x11... b'\x00\x00\x05L\x05\x14\xa8U\\\x1bA\x0c\xbfD_\...
[106 rows x 24 columns]
Once it is transformed to usual data frame, you can search, analyze and summary the captured data using your familier tools, like PostgreSQL or Python.
- Author: KaiGai Kohei
- Last Update: 17th-Feb-2021
- Change Logs:
- Initial revision (17th-Feb-2021)