-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: support of generic CPUs (w/o SSSE3) #3
Comments
Currently it only detects AVX2/SSSE3. I'll have to add support for a table lookup (super slow) fallback. |
Another good option here is to use the Longhair/CRS trick and split up the buffers by bit and do everything with XOR. This can achieve nearly the same performance in some cases! The only problem is that there won't be inter-op between the versions. So maybe that should be done in a different (new) library. |
Yeah using the reference version makes it run like 25x slower |
in PAR2 experiments (see ECC_speed_comparison.zip) non-SIMD implementation is only 4x slower (405.0 vs 1559.2 sec). Probably you are used non-optimal solution. You may look into Plank/Par2 sources. My understanding is what mulE should look like: result := a*exp(b) = exp(log(a)+b) here, exp and log can be represented with 65535-entry tables (omitting 0 value), each entry is 2-byte long, so both tables together occupy 256 KB, and we check for zeros/overflows: //return exp(log(a)+b), b!=0
mulE(a,b)
{
if (a==0) return 0;
loga = log[a];
logres = loga+b;
// logres %= 65535
logres = (logres&0xFFFF) + (logres>>16);
return exp[logres];
} Alternatively, we can replace the The alternative approach requires 2 memory reads and 2 ALU operations, i.e. can be executed in 1 cpu cycle. First approach needs 2 reads and 5 ALU ops, i.e. about 2 cycles. |
Just pushed some fallbacks for the 8-bit version that are only 5x slower using this table: static ffe_t Multiply8LUT[256 * 256]; I copied some approaches that worked well for GF256 library and adapted here. |
Relevant benchmarks from GF complete: First one is representative of current approach: This is the XOR-only version that would be incompatible: This is the current SSSE3+ approach: |
this looks like about 2 cpu cycles/byte, 1 cpb and 0.5 cpb, respectively first approach require 1 memory read per operation. According to Intel optimization manual, Skylake L2$ can deliver one line in 64/29~=2 cycles, and L3$ in 18/64~=4 cycles (Table 2-4. Cache Parameters of the Skylake Microarchitecture), so it's clearly limited by the cache throughput I think, that 12-bit implementation is obvious interesting goal. Both to improve decoding O(65536) behaviour, and have faster encode - with and without ssse3 Among modern installed cpus, 99% have ssse3 support. So as much as i think that support of older cpus is required to keep compatibility, i also think that separate, incompatible XOR mode for these cpus doesn't make sense. |
There is a port to SIMDe in the simde branch in my fork which should work everywhere. I haven't spent any time tuning it, so there is probably some low-hanging fruit someone can identify with a profiler… Performance will likely depend pretty heavily on auto-vectorization since for now SIMDe relies on it pretty heavily, though many functions do also have native NEON implementations. |
It will be great if Leopard can in runtime detect whether CPU supports AVX2/SSSE3/none and executes the appropriate code
The text was updated successfully, but these errors were encountered: