Request: support of generic CPUs (w/o SSSE3) #3

Bulat-Ziganshin · 2017-05-29T11:23:16Z

It will be great if Leopard can in runtime detect whether CPU supports AVX2/SSSE3/none and executes the appropriate code

catid · 2017-05-29T11:36:03Z

Currently it only detects AVX2/SSSE3. I'll have to add support for a table lookup (super slow) fallback.

catid · 2017-05-29T22:08:34Z

Another good option here is to use the Longhair/CRS trick and split up the buffers by bit and do everything with XOR. This can achieve nearly the same performance in some cases! The only problem is that there won't be inter-op between the versions. So maybe that should be done in a different (new) library.

catid · 2017-06-01T05:51:43Z

Yeah using the reference version makes it run like 25x slower

Bulat-Ziganshin · 2017-06-01T06:11:23Z

in PAR2 experiments (see ECC_speed_comparison.zip) non-SIMD implementation is only 4x slower (405.0 vs 1559.2 sec). Probably you are used non-optimal solution. You may look into Plank/Par2 sources. My understanding is what mulE should look like:

result := a*exp(b) = exp(log(a)+b)

here, exp and log can be represented with 65535-entry tables (omitting 0 value), each entry is 2-byte long, so both tables together occupy 256 KB, and we check for zeros/overflows:

//return exp(log(a)+b), b!=0
mulE(a,b)
{
    if (a==0)  return 0;
    loga = log[a];
    logres = loga+b;

    // logres %= 65535
    logres = (logres&0xFFFF) + (logres>>16);

    return exp[logres];
}

Alternatively, we can replace the logres %= 65535 computation with larger exp[] table containg 131070 elements, so log+exp tables will together occupy 384 KB which is a bit too much for Intel L2 caches. I don't know which approach will be faster. For cpus that don't have 256 KB caches (i.e. either <=128KB or >=512 KB), the alternative approach should be faster anyway.

The alternative approach requires 2 memory reads and 2 ALU operations, i.e. can be executed in 1 cpu cycle. First approach needs 2 reads and 5 ALU ops, i.e. about 2 cycles.

catid · 2017-06-02T06:56:06Z

Just pushed some fallbacks for the 8-bit version that are only 5x slower using this table:

static ffe_t Multiply8LUT[256 * 256];

I copied some approaches that worked well for GF256 library and adapted here.

catid · 2017-06-02T07:18:02Z

Relevant benchmarks from GF complete:

First one is representative of current approach:
Region Best (MB/s): 1635.21 W-Method: 8 -m TABLE -r DOUBLE -

This is the XOR-only version that would be incompatible:
Region Best (MB/s): 2991.66 W-Method: 8 -m TABLE -r CAUCHY -

This is the current SSSE3+ approach:
Region Best (MB/s): 7098.46 W-Method: 8 -m SPLIT 8 4 -

Bulat-Ziganshin · 2017-06-02T09:45:08Z

this looks like about 2 cpu cycles/byte, 1 cpb and 0.5 cpb, respectively

first approach require 1 memory read per operation. According to Intel optimization manual, Skylake L2$ can deliver one line in 64/29~=2 cycles, and L3$ in 18/64~=4 cycles (Table 2-4. Cache Parameters of the Skylake Microarchitecture), so it's clearly limited by the cache throughput

I think, that 12-bit implementation is obvious interesting goal. Both to improve decoding O(65536) behaviour, and have faster encode - with and without ssse3

Among modern installed cpus, 99% have ssse3 support. So as much as i think that support of older cpus is required to keep compatibility, i also think that separate, incompatible XOR mode for these cpus doesn't make sense.

nemequ · 2018-07-21T04:04:07Z

There is a port to SIMDe in the simde branch in my fork which should work everywhere. I haven't spent any time tuning it, so there is probably some low-hanging fruit someone can identify with a profiler… Performance will likely depend pretty heavily on auto-vectorization since for now SIMDe relies on it pretty heavily, though many functions do also have native NEON implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: support of generic CPUs (w/o SSSE3) #3

Request: support of generic CPUs (w/o SSSE3) #3

Bulat-Ziganshin commented May 29, 2017

catid commented May 29, 2017

catid commented May 29, 2017 •

edited

Loading

catid commented Jun 1, 2017

Bulat-Ziganshin commented Jun 1, 2017 •

edited

Loading

catid commented Jun 2, 2017

catid commented Jun 2, 2017

Bulat-Ziganshin commented Jun 2, 2017 •

edited

Loading

nemequ commented Jul 21, 2018

Request: support of generic CPUs (w/o SSSE3) #3

Request: support of generic CPUs (w/o SSSE3) #3

Comments

Bulat-Ziganshin commented May 29, 2017

catid commented May 29, 2017

catid commented May 29, 2017 • edited Loading

catid commented Jun 1, 2017

Bulat-Ziganshin commented Jun 1, 2017 • edited Loading

catid commented Jun 2, 2017

catid commented Jun 2, 2017

Bulat-Ziganshin commented Jun 2, 2017 • edited Loading

nemequ commented Jul 21, 2018

catid commented May 29, 2017 •

edited

Loading

Bulat-Ziganshin commented Jun 1, 2017 •

edited

Loading

Bulat-Ziganshin commented Jun 2, 2017 •

edited

Loading