Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: support of generic CPUs (w/o SSSE3) #3

Open
Bulat-Ziganshin opened this issue May 29, 2017 · 8 comments
Open

Request: support of generic CPUs (w/o SSSE3) #3

Bulat-Ziganshin opened this issue May 29, 2017 · 8 comments

Comments

@Bulat-Ziganshin
Copy link

It will be great if Leopard can in runtime detect whether CPU supports AVX2/SSSE3/none and executes the appropriate code

@catid
Copy link
Owner

catid commented May 29, 2017

Currently it only detects AVX2/SSSE3. I'll have to add support for a table lookup (super slow) fallback.

@catid
Copy link
Owner

catid commented May 29, 2017

Another good option here is to use the Longhair/CRS trick and split up the buffers by bit and do everything with XOR. This can achieve nearly the same performance in some cases! The only problem is that there won't be inter-op between the versions. So maybe that should be done in a different (new) library.

@catid
Copy link
Owner

catid commented Jun 1, 2017

Yeah using the reference version makes it run like 25x slower

@Bulat-Ziganshin
Copy link
Author

Bulat-Ziganshin commented Jun 1, 2017

in PAR2 experiments (see ECC_speed_comparison.zip) non-SIMD implementation is only 4x slower (405.0 vs 1559.2 sec). Probably you are used non-optimal solution. You may look into Plank/Par2 sources. My understanding is what mulE should look like:

result := a*exp(b) = exp(log(a)+b)

here, exp and log can be represented with 65535-entry tables (omitting 0 value), each entry is 2-byte long, so both tables together occupy 256 KB, and we check for zeros/overflows:

//return exp(log(a)+b), b!=0
mulE(a,b)
{
    if (a==0)  return 0;
    loga = log[a];
    logres = loga+b;

    // logres %= 65535
    logres = (logres&0xFFFF) + (logres>>16);

    return exp[logres];
}

Alternatively, we can replace the logres %= 65535 computation with larger exp[] table containg 131070 elements, so log+exp tables will together occupy 384 KB which is a bit too much for Intel L2 caches. I don't know which approach will be faster. For cpus that don't have 256 KB caches (i.e. either <=128KB or >=512 KB), the alternative approach should be faster anyway.

The alternative approach requires 2 memory reads and 2 ALU operations, i.e. can be executed in 1 cpu cycle. First approach needs 2 reads and 5 ALU ops, i.e. about 2 cycles.

@catid
Copy link
Owner

catid commented Jun 2, 2017

Just pushed some fallbacks for the 8-bit version that are only 5x slower using this table:

static ffe_t Multiply8LUT[256 * 256];

I copied some approaches that worked well for GF256 library and adapted here.

@catid
Copy link
Owner

catid commented Jun 2, 2017

Relevant benchmarks from GF complete:

First one is representative of current approach:
Region Best (MB/s): 1635.21 W-Method: 8 -m TABLE -r DOUBLE -

This is the XOR-only version that would be incompatible:
Region Best (MB/s): 2991.66 W-Method: 8 -m TABLE -r CAUCHY -

This is the current SSSE3+ approach:
Region Best (MB/s): 7098.46 W-Method: 8 -m SPLIT 8 4 -

@Bulat-Ziganshin
Copy link
Author

Bulat-Ziganshin commented Jun 2, 2017

this looks like about 2 cpu cycles/byte, 1 cpb and 0.5 cpb, respectively

first approach require 1 memory read per operation. According to Intel optimization manual, Skylake L2$ can deliver one line in 64/29~=2 cycles, and L3$ in 18/64~=4 cycles (Table 2-4. Cache Parameters of the Skylake Microarchitecture), so it's clearly limited by the cache throughput

I think, that 12-bit implementation is obvious interesting goal. Both to improve decoding O(65536) behaviour, and have faster encode - with and without ssse3

Among modern installed cpus, 99% have ssse3 support. So as much as i think that support of older cpus is required to keep compatibility, i also think that separate, incompatible XOR mode for these cpus doesn't make sense.

@nemequ
Copy link
Contributor

nemequ commented Jul 21, 2018

There is a port to SIMDe in the simde branch in my fork which should work everywhere. I haven't spent any time tuning it, so there is probably some low-hanging fruit someone can identify with a profiler… Performance will likely depend pretty heavily on auto-vectorization since for now SIMDe relies on it pretty heavily, though many functions do also have native NEON implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants