first of all, msys2/ucrt64 build is faster than msys2/mingw64 build. (msys2/clang64 env is used for clang builds)
"normal" version is without setting -march
x86-64-v3 version is with march=x86-64-v3
(cannot run on non-avx2 cpus)
znver4 version is of course with march=znver4
(cannot run on non-avx3 cpus)
both seem to be slightly faster than John's at rarewares.org
on zen4, roughly: 700x -> 715x -> 730x
and yes, clang build is significantly slower (630x), gcc build is recommended. See the benchmark.
fast-math
seems to improve speed of avx-512 enabled builds (znver4 and x86-64-v4) but nothing really noticeable for avx2 (x86-64-v3) build.
roughly, 750x
p.s. all these march or fast-math changes makes encoder to encode the file slightly differently
update: add clang builds, various optimization options.
Some kind of benchmark
source length: 01:25:03.000 or 5103 seconds
source format: Wave PCM S16 48000Hz Stereo
measurement: py -m timeit -n 1 -r 5 -v -s "import subprocess" "subprocess.run('\"hmp3-gcc-fast-math\" 1.wav',shell=True)"
CPU: Ryzen 7900X
RAM: DDR5-6000 32x2 ~80GB/s 63.3ns (AIDA64)
SSD: NVMe PCIe 3.0 in external enclosure, ~400MB/s (where test file and hmp3 binary stored, OS is on internal samsung 980 pro)
compiler versions: gcc: 14.1.0, clang: 18.1.6
clang | gcc | gcc+pgo | clang+pgo | |
---|---|---|---|---|
normal | 7.58s/673.22x | 7.43s/686.81x | 7.23s/705.81x | |
fast-math | 7.54s/676.79x | 7.16s/712.71x | 7.05s/723.83x | |
x86-64-v3 | 7.14s/714.71x | 7.13s/715.71x | 6.8s/750.44x | |
x86-64-v3-fast-math | 6.98s/731.09x | 7.02s/726.92x | 6.95s/734.24x | 6.55s/779.08x |
znver4 | 7.82s/652.56x | 6.58s/775.53x | ||
znver4-fast-math | 11.6s/439.91x | 6.12s/833.82x | 5.92s/861.99x | 14.3s/356.85x(WTF?) |
dev-20240615 contains:
clang-fast-math-pgo,
clang-x86-64-v3-fast-math-pgo,
gcc-x86-64-v3-fast-math-pgo,
gcc-znver4-fast-math-pgo,
based on the benchmarks.