Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import fast crc32 from Stephan Brumme #327

Open
wants to merge 1 commit into
base: branch_libev
Choose a base branch
from

Conversation

lsylsy2
Copy link

@lsylsy2 lsylsy2 commented Jun 16, 2024

Description

After using perf to analyze the performance of UDPspeeder, the CRC32 function is costing 10~20% of CPU usage.
Replacing it with an open source faster implementation can make significant improvement in performance.

Performance Test

Setup

  1. UDPspeeder client: running on test machine A is a 1Core 1GiB VM running Debian 12 on my Proxmox VE NAS, which is mostly idling during the test, and have a Ryzen 3 4350G CPU.
  2. UDPspeeder server and iperf3 server: running on server B is running Debian 12 in a VM running on my Windows Hyper-V PC.
  3. ipserf3 client: running on my PC directly which also hosts server B.

Test machine A (UDPspeeder client) is running UDPspeeder binary from running "make" on crc32 and branch_libev branches, server B (UDPspeeder server) is running binary directly downloaded from github, to ensure compability.

Script used

Simulating delay and loss

tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 1ms loss 0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.133 flowid 1:3
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.1.132 flowid 1:3

UDPspeeder command lines

./speederv2_amd64 -s -l0.0.0.0:5301 -r127.0.0.1:5201
time ./speederv2_branch_libev -c -l0.0.0.0:5201 -r192.168.1.133:5301
time ./speederv2_crc -c -l0.0.0.0:5201 -r192.168.1.133:5301

iperf command lines

.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 --length 1400
.\iperf3.exe -c 192.168.1.132 -p 5201 -u -b 100M -t 60 -R --length 1400

Test results

"real time" is the time before speederv2 client is ran and Ctrl+C is pressed, not meaningful in the comparision.

    branch_libev crc32 PR branch_libev crc32 PR
ping=delay*2 Comments Send   Receive(-R)  
delay0/loss0 100M/length1400 Ideal LAN or same city real    1m4.002s user    0m8.940s sys     0m7.169s real    1m12.985s user    0m5.239s sys     0m6.836s real    1m4.562s user    0m5.966s sys     0m4.461s real    1m10.889s user    0m2.646s sys     0m4.357s
delay10/loss1 100M/length1400 ping=20 with 1% loss A pretty good China-HK/KR/JP network real    1m7.122s user    0m9.165s sys     0m4.124s real    1m10.473s user    0m5.827s sys     0m3.216s real    1m2.657s user    0m5.850s sys     0m4.625s real    1m2.515s user    0m2.633s sys     0m4.435s
delay10/loss1 50M/length200 Testing small packets real    1m14.906s user    0m7.325s sys     0m3.939s real    1m3.129s user    0m5.751s sys     0m3.155s real    1m6.626s user    0m3.138s sys     0m9.732s real    1m2.153s user    0m2.458s sys     0m9.062s
delay30/loss5 100M/length1400 ping=60 with 5% loss A less ideal network within Asia real    1m5.962s user    0m9.277s sys     0m7.061s real    1m3.250s user    0m5.494s sys     0m4.875s real    1m7.642s user    0m6.329s sys     0m4.158s real    1m3.010s user    0m2.457s sys     0m4.704s
delay30/loss5 50M/length200 Testing small packets real    1m33.371s user    0m7.033s sys     0m5.237s real    1m7.538s user    0m5.097s sys     0m3.929s real    1m3.274s user    0m2.006s sys     0m11.125s real    1m3.346s user    0m1.585s sys     0m9.966s
delay80/loss10 50M/length1400 ping=160 with 10% loss Pretty bad network across the Pacific Usually aiming low cost Web browsing real    1m5.898s user    0m4.867s sys     0m2.093s real    1m4.362s user    0m3.054s sys     0m1.760s real    1m9.226s user    0m3.642s sys     0m1.938s real    1m4.579s user    0m1.189s sys     0m2.671s

%9GCR@R$7_U~8NSG{NZ670X

BIG ENDIAN Validation

TODO

Flame Graph

TODO

@lsylsy2
Copy link
Author

lsylsy2 commented Jun 16, 2024

@wangyu- I was measuring CPU usage in Internet running iperf3, however, that may not be trusty enough for submitting PRs. Do you have any suggestion on the dataset and how to evaluate the performance?

@lsylsy2 lsylsy2 marked this pull request as ready for review June 16, 2024 15:57
@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

the best way is probably flame graph:

If you cannot make flame graph working. You can consider make a simple benchmark between the crc32h and crc32fast. If the performance difference is big, it's still convincing enough this is a useful PR.

@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

@lsylsy2
Copy link
Author

lsylsy2 commented Jun 17, 2024

@lsylsy2 Hi, thanks for the PR.

For peformance measuring:

the best way is probably flame graph, here is an example for udp2raw I did previously:

you can send same speed of packets with iperf3, then genertate the flame graph before and after the change.

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

I was finding the performance issue using flamegraph, in larger throughput scenarios (iperf with large udp packets), crc32h was costing 20% of time. However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.
BTW, optimizing the XOR encryption can also improve the performance by 3~10% by utilizing 64bit operations, but it's written by me and have not been tested on multiple platforms (and also need to modify to support 32bit systems, etc.), so I'll not submit it very soon.
branch_libev...lsylsy2:UDPspeeder:2406_optimization

From the source code, looks like the author has already considered the case of BIG ENDIAN systems.

Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems?

I myself have not, but the library itself supports BIG ENDIAN and been tested (and bug fixed), this could be an example. stbrumme/crc32#8
Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

@lsylsy2
Copy link
Author

lsylsy2 commented Jun 17, 2024

branch_libev
This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.

I think you idea works.

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Do you know can I have some virtual machines running in BIG ENDIAN and test it? It seems even the latest ARM Mac is using little endian.

Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO.

@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

@lsylsy2
Copy link
Author

lsylsy2 commented Jun 17, 2024

IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot.

(I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326)

FEC is more resource consuming on the sender side, if used in a "server is a cloud virtual server, client is a consumer router, download from server to client is usually much larger than upload" scenario, FEC may act as a less important role.
I'll try to make more tests and send the result later.

@lsylsy2
Copy link
Author

lsylsy2 commented Jun 17, 2024

This is one flame graph I run over branch_libev and used iperf to send and receive UDP packets. However this is run over WAN and packet drop rate was instable, it still shows crc32h is costing a lot.

Intereseting. Is this on the sending end or receiver end?

If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32.

Server: Oracle ARM VPS in Osaka
Client (running perf and generating this graph): a single-core virtual machine on a mostly idling AMD Ryzen 3 pro 4350g, which single thread performance should be similar to Ryzen 5 3600 or i3-10100.
The test was ran that sending and receiving was ran for both 60 seconds, each with 100Mb(it)ps throughput, however I forget if I set the packet size to 200/400 or the default ~1400.

@wangyu-
Copy link
Owner

wangyu- commented Jun 17, 2024

Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.

Forgot to say. tc and netem is acutally easier on simulate packet loss.

Here is some example code piece:

DEV=ens5

# turn driver optimizations off
sudo ethtool -K $DEV gro off
sudo ethtool -K $DEV tso off
sudo ethtool -K $DEV gso off

sudo tc qdisc del dev $DEV root
sudo tc qdisc add dev $DEV root netem loss 5.5%

(it's copied from a more complexed file I wrote. It might work perfectly, or might have some typo)

@lsylsy2
Copy link
Author

lsylsy2 commented Jun 22, 2024

Hi, I've updated some performance tests. overall it's bringing performance improvements in all scenarios tested (at least in amd64).
Later adding flame graph comparison and BIG ENDIAN validation

@tofurky
Copy link

tofurky commented Oct 10, 2024

hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test.

note that there is a small change needed to fix compilation with cmake (oh, also note i switched to -O3 in my tree, but unrelated)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index d6b11ef..ca34fe0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,6 +23,7 @@ set(SOURCE_FILES
         tunnel_client.cpp
         tunnel_server.cpp
         my_ev.cpp
+       crc32/Crc32.cpp
 )
 set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")

edit: disabling -fsanitize=address cflag, which is enabled by default, further improves performance. apparently it adds about 2x runtime overhead.

@lsylsy2
Copy link
Author

lsylsy2 commented Oct 12, 2024 via email

@xAJx1383
Copy link

xAJx1383 commented Jan 2, 2025

you could use isa-l library for implementing fast FEC and crc32,
in my tests the limiting factor with this library was my ram speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants