-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import fast crc32 from Stephan Brumme #327
base: branch_libev
Are you sure you want to change the base?
Conversation
@wangyu- I was measuring CPU usage in Internet running iperf3, however, that may not be trusty enough for submitting PRs. Do you have any suggestion on the dataset and how to evaluate the performance? |
@lsylsy2 Hi, thanks for the PR. For peformance measuring: the best way is probably flame graph, here is an example for udp2raw I did previously: you can send same speed of packets with iperf3, then genertate the flame graph before and after the change. |
IMO the current bottleneck is at the FEC library. This PR might improve the crc32 speed a lot, but might not be able to improve the overall speed a lot. (I preivously made some comments on improving the speed in https://github.com/wangyu-/UDPspeeder/issues/326) |
If you cannot make flame graph working. You can consider make a simple benchmark between the crc32h and crc32fast. If the performance difference is big, it's still convincing enough this is a useful PR. |
From the source code, looks like the author has already considered the case of BIG ENDIAN systems. Have you or the author of the library acutally tested crc32fast on BIG ENDIAN systems? |
I was finding the performance issue using flamegraph, in larger throughput scenarios (iperf with large udp packets), crc32h was costing 20% of time. However, my test was run over WAN with unstable underlying link, so I was asking for if there is a performance measuring standard. Will try to run over two machines in LAN and introducing stable packet drops.
I myself have not, but the library itself supports BIG ENDIAN and been tested (and bug fixed), this could be an example. stbrumme/crc32#8 |
I think you idea works. Personally for convenience I would do it in VM with virtualize LANs (I personally I use Proxmox). Simulate packet loss with iptables or something else. Send fixed speed of packet with iperf3.
Bochs can similuar BIG ENDIAN systems on PC. The most commonly seen BIG ENDIAN systems now days is (BigEndian) MIPS. A simple verify on (BigEndian) MIPS with Bochs is sufficient IMO. |
Intereseting. Is this on the sending end or receiver end? If it's the receiver end and packet loss is very tiny, then it's possible the FEC library doesn't need to do any calculation, and the bottleneck become crc32. |
FEC is more resource consuming on the sender side, if used in a "server is a cloud virtual server, client is a consumer router, download from server to client is usually much larger than upload" scenario, FEC may act as a less important role. |
Server: Oracle ARM VPS in Osaka |
Forgot to say. Here is some example code piece:
(it's copied from a more complexed file I wrote. It might work perfectly, or might have some typo) |
Hi, I've updated some performance tests. overall it's bringing performance improvements in all scenarios tested (at least in amd64). |
hi, thanks for this PR. i have not done proper benchmarks but saw my throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on amd64 (debian 12) running simple speedtest.net (via their linux CLI client) test. note that there is a small change needed to fix compilation with cmake (oh, also note i switched to -O3 in my tree, but unrelated)
edit: disabling -fsanitize=address cflag, which is enabled by default, further improves performance. apparently it adds about 2x runtime overhead. |
Thank you for the response and improvement, I was busy on private work and
not done the BIG ENDIAN test. Let me see if I can try and include the other
improvements
tofurky ***@***.***> 于2024年10月11日周五 02:32写道:
… hi, thanks for this PR. i have not done proper benchmarks but saw my
throughput (with 100% cpu server-side) at ~14mbps jump to almost 90mbps on
amd64 (debian 12) running simple speedtest.net (via their linux CLI
client) test.
note that there is a small change needed to fix compilation with cmake :
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d6b11ef..ca34fe0 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,6 +23,7 @@ set(SOURCE_FILES
tunnel_client.cpp
tunnel_server.cpp
my_ev.cpp
+ crc32/Crc32.cpp
)
set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wno-unused-variable -Wno-unused-parameter -Wno-missing-field-initializers -O3 -g -fsanitize=address,undefined")
—
Reply to this email directly, view it on GitHub
<https://github.com/wangyu-/UDPspeeder/pull/327#issuecomment-2405784993>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGUTWIR2HGVOFANR2PHEITZ23B23AVCNFSM6AAAAABJMU6V56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVG44DIOJZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
you could use isa-l library for implementing fast FEC and crc32, |
Description
After using perf to analyze the performance of UDPspeeder, the CRC32 function is costing 10~20% of CPU usage.
Replacing it with an open source faster implementation can make significant improvement in performance.
Performance Test
Setup
Test machine A (UDPspeeder client) is running UDPspeeder binary from running "make" on crc32 and branch_libev branches, server B (UDPspeeder server) is running binary directly downloaded from github, to ensure compability.
Script used
Simulating delay and loss
UDPspeeder command lines
iperf command lines
Test results
"real time" is the time before speederv2 client is ran and Ctrl+C is pressed, not meaningful in the comparision.
BIG ENDIAN Validation
TODO
Flame Graph
TODO