Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to customize CRC polynomial and order #9

Open
federicomenaquintero opened this issue May 28, 2019 · 12 comments
Open

Comments

@federicomenaquintero
Copy link

federicomenaquintero commented May 28, 2019

This is part feature request, part "I don't really know what I'm doing" :)

I'm playing around with some code to port parts of bzip2 to Rust. Its hand-written crc32 code looks like this: https://gitlab.com/federicomenaquintero/bzip2/blob/e4cdd603608e8dc87a526d287369775601c63df9/crc32.c

Here is a test with a few computed checksums: https://gitlab.com/federicomenaquintero/bzip2/blob/e4cdd603608e8dc87a526d287369775601c63df9/crc32test.c

I think this is a different CRC order than what crc32fast uses. After getting help, someone kindly suggested using the custom constructor from the crc crate: https://gitlab.com/federicomenaquintero/bzip2/blob/6ef10026a74af85437a4deca6957889d9e8bdbf8/bzlib_rust/src/lib.rs

That last link has Rust code that computes the same CRC32 values as the original C code. I would rather use crc32fast for this, of course, and maybe speed up bzip2 as a result :)

Would it be possible for crc32fast to provide enough machinery to replace this kind of C code which has any random variant of CRC32?

@valpackett
Copy link
Contributor

valpackett commented Jun 1, 2019

I found a very interesting document that explained how your "crc32a" / crc::CalcType::Normal calculation is different: it does not reverse the order of bits, while the "crc32b" calculation used in most places does.

I'm not familiar with @srijs's amd64 SSE implementation, it's probably possible to directly use a different thing there, but on aarch64 (ARMv8-A) the reversed calculation is implemented directly in hardware, so the only option is to do the obvious thing of reversing the input and output data :)

#![feature(asm, stdsimd, core_intrinsics, aarch64_target_feature)]
use std::arch::aarch64 as arch;
use std::intrinsics::bitreverse;

#[inline(always)]
unsafe fn reverse8x8(x: u64) -> u64 {
    // do not use llvm.bitreverse on a u8x8, that does not emit rbit
    let result: u64;
    asm!("rbit $1.8b, $0.8b" : "=w"(result) : "w"(x));
    result
}

#[target_feature(enable = "crc")]
pub unsafe fn calculate(crc: u32, data: &[u8]) -> u32 {
    let mut c32 = !crc; // XXX: here too?

    let (pre_quad, quads, post_quad) = data.align_to::<u64>();

    c32 = pre_quad.iter().fold(c32, |acc, &b| arch::__crc32b(acc, bitreverse(b)));

    // unrolling increases performance by ~~a lot~~ not THAT much with the reverses in there
    let mut quad_iter = quads.chunks_exact(8);
    for chunk in &mut quad_iter {
        c32 = arch::__crc32d(c32, reverse8x8(chunk[0]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[1]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[2]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[3]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[4]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[5]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[6]));
        c32 = arch::__crc32d(c32, reverse8x8(chunk[7]));
    }
    c32 = quad_iter
        .remainder()
        .iter()
        .fold(c32, |acc, &q| arch::__crc32d(acc, reverse8x8(q)));

    c32 = post_quad.iter().fold(c32, |acc, &b| arch::__crc32b(acc, bitreverse(b)));

    bitreverse(!c32)
}

fn main() {
    println!("{:x}", unsafe { calculate(0, b"123456789") }); // fc891918
    println!("{:x}", unsafe { calculate(0, b"") }); // 0
    println!("{:x}", unsafe { calculate(0, b" ") }); // 29d4f6ab
    println!("{:x}", unsafe { calculate(0, b"hello world") }); // 44f71378
    let buf4 =
        concat!("Lorem ipsum dolor sit amet, consectetur adipiscing elit, ",
                "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ",
                "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ",
                "ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit ",
                "in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur ",
                "sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt ",
                "mollit anim id est laborum.");
    println!("{:x}", unsafe { calculate(0, buf4.as_bytes()) }); // d31de6c9
}

On a Cortex-A72:

test bench_kilobyte_specialized ... bench:         186 ns/iter (+/- 0) = 5505 MB/s
test bench_megabyte_specialized ... bench:     246,951 ns/iter (+/- 91,406) = 4246 MB/s

Unfortunately not as fast as the original version without the reverses:

test bench_kilobyte_specialized ... bench:          88 ns/iter (+/- 0) = 11636 MB/s
test bench_megabyte_specialized ... bench:     154,795 ns/iter (+/- 54,328) = 6773 MB/s

(but still, baseline (table) version is only <800 MB/s!!)

The slowdown is because we need to reverse bits on each byte (not a whole 64-bit word), which requires the vector version of the RBIT instruction, which operates on vector registers, while CRC32 operates on general purpose ones :(

@valpackett
Copy link
Contributor

It's also possible to implement CRC32 using the polynomial multiplication instructions and that would allow using the non-reversed thing directly I guess. No idea about the performance of that though :)

@paolobarbolini
Copy link

This problem also took some time for me to figure out, and for some reason I didn't see this issue while I was researching it. Since the last answer was from almost 2 years ago, here's my updated solution for wrapping crc32fast to make it calculate the crc32a instead of the crc32b.

https://github.com/paolobarbolini/bzip2-rs/blob/59073eeb1b58f95a6d7809c4d2889ef297291a06/src/crc.rs#L1-L56

@ghost
Copy link

ghost commented Apr 15, 2022

So the funny thing here is the Intel paper describing this optimization actually answers this. You can read it here:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

The main section of the paper talks about 802.3 (Ethernet, FDDI, etc.) checksums, which are the same as BZIP2. If you use the details from that section, you should get a BZIP2 compatible version. It's only on page 18, in the section titled "Bit-Reflection", that they get to talking about ZIP (specifically mentioning gzip) checksums, and how to permute the algorithm to work for it. The coefficients here are identical to the ones in the crate implementation (answering some confusion in the source code).

To understand why this works, it might help to understand how a CRC works in general. It's often described as "polynomial" division, but this isn't the math you learned in algebra class. If you think of it as a bit-wise long division of the message data by the "polynomial", you're actually xoring the "polynomial" on each step instead of subtracting it as you expect. (Technically this is subtraction in GF(2), but just forget you ever read that.)

Because of the peculiar math of CRCs, we can actually shift in input bits in all sorts of ways and orders. Tacking another input bit on the end and doing another step of our long division is more or less what the canonical 1-bit CRC loop does:

		for (j = 0; j < 8; j++) {
			carry = ((crc & 0x01) ? 1 : 0) ^ (c & 0x01);
			crc >>= 1;
			c >>= 1;
			if (carry)
				crc ^= ETHER_CRC_POLY_LE;
		}

But ho! It turns out if we xor in 8 bits at a time, that works just fine too. So for example we could do this:

		crc ^= c;
		for (j = 0; j < 8; j++) {
			carry = ((crc & 0x01) ? 1 : 0);
			crc >>= 1;
			if (carry)
				crc ^= ETHER_CRC_POLY_LE;
		}

The next 8 steps of our long division depend only on the 8 MSB bits of the remainder/accumulator at this point (here actually in the LSBs; see below), so rather than convoluting 8 more times, we can build a lookup table and do one convolution. And thus we get by far the most common software CRC implementation in the universe. It often looks something like this (example taken from zlib):

            crc = (crc >> 8) ^ crc_table[(crc ^ *buf++) & 0xff];

But ho! For some reason people can't make up their mind whether they want data bytes to go in LSB first or MSB first. And thus we have so-called "reflection". Like "polynomials", it's a silly name and doesn't really correspond with how we calculate things. But the main thrust is that, because our table-driven algorithm can work equally well shifting the accumulator either direction, we generally choose the direction that makes the input data line up nicely to xor it into the accumulator directly without bit-reversing.

This is why, despite using the same "polynomial", a typical BZIP2 byte-wide implementation will shift the accumulator to the left and (effectively) insert data at the top (example taken from bzip2):

{                                              \
   crcVar = (crcVar << 8) ^                    \
            crc32Table[(crcVar >> 24) ^        \
                       ((UChar)cha)];          \
}

But a similar ZIP CRC computation will shift to the right and insert data at the bottom (where have I seen this before):

            crc = (crc >> 8) ^ crc_table[(crc ^ *buf++) & 0xff];

The table computation of course has to be adjusted depending on the direction.as well. But the thrust is that, as long as "reflectin==reflectout" (spoiler: it is for any CRC you care about), you don't ever have to reverse bits.

If you're shifting in a value the size of your accumulator (say, 32 bits at a time if you're doing the common 4-table version or just unrolling the one-table version), you can even leave your accumulator and table in whatever byte order is fastest for reading data from memory and xor the input data with no byte swapping. In fact zlib does this, only byte-swapping while generating the table and between input chunks:

        crc_big_table[i] = byte_swap(p);
            crc0 = byte_swap(crc);
            crc = byte_swap(comb);

BTW, the choice of direction isn't entirely random. For many things, it corresponds directly with a physical transmission order. For 802.3, the transmission order is always MSB first, so our hardware CRC checker naturally wants to compute the CRC that way. For HDLC, which was most commonly used in ISDN, the transmission order was LSB first, so the CRC was arranged that way. Pure software implementations often blindly copy a known CRC algorithm, and thus BZIP2 used the 802.3 method and PKZIP used the HDLC method.

This makes even more sense when you realize that PKZIP was created in a primarily modem world (and specifically to replace ARC for BBS use), which was often LSB first, and BZIP2 was created in the Internet era, where everyone is connected in some fashion through Ethernet (802.3) and/or WiFi (802.11, which uses the same CRC).

[I gotta sleep. I'll write part 2 of this later.]

@ghost
Copy link

ghost commented Apr 15, 2022

Still working on the next part, but I realized an example might be useful here, to clarify why CRC algorithms are written the way they are.

Here's an example of the long division written out verbosely, that I pulled from an old CRC tutorial that actually doesn't explain it well:

            1100001010 = Quotient (nobody cares about the quotient)
       _______________
10011 ) 11010110110000 = Augmented message (1101011011 + 0000)
=Poly   10011,,.,,....
        -----,,.,,....
         10011,.,,....
         10011,.,,....
         -----,.,,....
          00001.,,....
          00000.,,....
          -----.,,....
           00010,,....
           00000,,....
           -----,,....
            00101,....
            00000,....
            -----,....
             01011....
             00000....
             -----....
              10110...
              10011...
              -----...
               01010..
               00000..
               -----..
                10100.
                10011.
                -----.
                 01110
                 00000
                 -----
                  1110 = Remainder = THE CHECKSUM!!!!

There are a couple of things I want to draw attention to here. One is that, due to the weird mathematics going on, whether we generate a 0 or 1 in the quotient at any given moment depends ONLY on the topmost bit of the remainder at the time. So why is it that our condition in this code seems to use the next input bit as well?

			carry = ((crc & 0x01) ? 1 : 0) ^ (c & 0x01);
			crc >>= 1;
			c >>= 1;
			if (carry)
				crc ^= ETHER_CRC_POLY_LE;

Notice how, at each step of the division, another bit is getting sucked into the LSB of our remainder. This is exactly like when you did long division in grade school, except it's binary and you're using xor instead of subtraction. (So really not much like that at all, but bear with me.) Here's the rub: that bit just sits there passively, maybe getting xored with some other bits it has no control over, until it becomes the MSB. Further, we don't want to overread past the end of the data, or have to pad it in memory.

So, since xor has no carries or anything to worry about, we lazily xor the input bit only when we need it, which is to say at the moment it becomes the MSB, yielding the above algorithm. The byte-wide and word-wide algorithms work because we can really xor this bit in any time, as long as it's not past the end of the accumulator, and so we do the more efficient thing of adding (whoops, I mean xoring) a bunch of bits at once.

This has a couple of indirect implications. One is that the final state has the accumulator effectively thinking it saw a bunch of 0s past the end of the message. Some CRC variants (like ZIP) want this to be all 1s instead--so we just xor it in at the end. This is why crc-catalog has an "xorout" field.

The other is that we're assuming the state of the remainder/accumulator was 0s at the beginning. Remember we might be continuing a previous computation. That previous computation also didn't know what the next input bits would be, which is yet another reason for adding--XORING--them late. All of this yields an extremely simple algorithm that also works well for piecewise computation, which ticks all the right boxes.

Lastly, some CRC algorithms for whatever reason think the remainder should start with all 1s (or sometimes more exotic values, although most of these have rational explanations). This is why crc-catalog has an "init" field. (A more concrete reason for this requirement is that, if we start with all 0s, a leading sequence of 0s in the input will not cause the remainder to change at all. We could wildly change the number of leading bits and not be able to detect it. This would be an unfortunate blind spot for serial transmission protocols.)

@ghost
Copy link

ghost commented Apr 15, 2022

Now that we (hopefully) understand the basic building blocks of a CRC, we can talk about the "folding" algorithm and try to clarify the mathematics a bit. From the Intel paper, you'll see they define a CRC this way:

CRC (M(x)) = x^deg(P(x)) • M(x) mod P(x)

Now if you recall from our long division example, we imagined some 0 bits stuffed in at the bottom:

10011 ) 11010110110000 = Augmented message (1101011011 + 0000)

Well it turns out the scary algebra is exactly that padding... for our 4-bit CRC, we're left-shifting 4 bits and stuffing 4 bits in at the bottom. For a 32-bit CRC, "x^deg(P(x)) • M(x)" means what as a programmer you might more naturally think of as "M << 32". Let's not think about this padding too much right now, and just call our extended message M.

Now modular multiplication has a funky identity rule:

(A • T + B) mod P = (A • (T mod P)) + B) mod P

And this identity also applies to our GF(2) math, just remember that + is really xor.

How does this help us? Let's give an example. Say we have a 64-bit message. We can break this down into two components A and B that are 32 bits and "fold" them together like this:

CRC(M) = M mod P == (A • 1<<32 xor B) mod P == (A • (1<<32 mod P) xor B) mod P

Now if we precompute (1<<32 mod P) our equation comes out to:

const K = 1<<32 mod P
CRC(M) = ... = ((A • K) xor B) mod P

And we can keep chaining these "folds" ad infinitum. In fact this is one approach for implementing a fast CRC calculation in hardware. However, in software, traditionally there was been no carry-less multiply instruction, so the table-based approach has almost always been preferred for speed.

This precomputed table is effectively all the possible values of (A•K mod P). For 4-, 8-, and 16-way unfolded versions, we calculate additional tables with different K values to adjust for the distance. Luckily in this case we can include the "mod P" in the table, so our intermediates are the same size as our accumulator and the only convolution we need is xor. This is why everyone loves that implementation.

I will include by reference one tableless software implementation to give you an idea if you care:
CRC32 with 8-bit tableless multiply

But if we just happen to find a carry-less multiply instruction in our back pocket, we can do this same fold in just two instructions... a multiply and an xor... and no table! Pretty cool, eh? It only took us about 30 years to get the instruction (and we still don't have it in the ALU).

Well it turns out we can fold values that are even further apart, just remember that our "x^n" factor increases accordingly. And we can also fold chunks that are wider than our polynomial, it doesn't matter. So for example, say we want to fold two 64-bit chunks that are 512 bits apart:

const K = 1<<512 mod P
F mod P == (A<<512 xor B) mod P == (A•K xor B) mod P

Now why would we want to do this? If we're only folding adjacent chunks, we have linear dependencies and fetch dependencies which might limit our performance on a pipelined CPU. By interleaving multiple operations, we avoid these pipelining limits. And since we're multiplying by a remainder value, our intermediate value is no larger than our input chunk size plus our polynomial order (in this case 96 bits), no matter the stride distance.

All we have to do is clean up at the end by folding these few remaining adjacent chunks together and adding any extra padding at the end. This gives us additional K values for the closer folds, which you see in the paper. (Or we could just stuff them through a simpler CRC shifter at this point, which is something zlib does, but then we'd likely be introducing a dreaded table.)

@ghost
Copy link

ghost commented Apr 15, 2022

For extra credit, I guess we can talk about how this allows us to parallelize CRC computations.

Imagine we have a large data set split up into blocks, and we have simultaneously folded each block into just a single value at the end. We could in principle fold these block CRCs together as well, using:

const K = 1<<blocksize*8 mod P

And thus we would have truly the fastest CRC generator in the world. Problem is there is no I/O subsystem that could feed it. It might be fun for bragging purposes, though. Imagine an i9-12900k cranking >100GB/s. Could we achieve 200x as fast as the crc crate? Would anyone be embarrassed?

BTW, how can we easily calculate K? By squaring it! Think about extending this sequence:

1<< 64 mod P = ((1<< 32) * (1<< 32)) mod P  # k6 in the paper
             = ((1<< 32 mod P) * (1<< 32 mod P)) mod P
1<<128 mod P = ((1<< 64) * (1<< 64)) mod P  # k4 in the paper
             = ((1<< 64 mod P) * (1<< 64 mod P)) mod P
1<<256 mod P = ((1<<128) * (1<<128)) mod P
             = ((1<<128 mod P) * (1<<128 mod P)) mod P

You can try this in MATLAB (for free on the web):

p = flip([1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1]);
k6 = flip([0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 0 1]);
a = gfconv(k6, k6, 2); % multiply
[q, r] = gfdeconv(a, p, 2); % divide
flip(r)

This gives us:

1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1

Which just happens to be the value of k4.

And now you also see how you can painlessly calculate the coefficients for any CRC you want to implement using this method:

p = flip([1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1]);
o32 = flip([1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]);
[q, k32] = gfdeconv(o32, p, 2);
[q, k6] = gfdeconv(gfconv(k32, k32, 2), p, 2);
flip(k6)
[q, k5] = gfdeconv(gfconv(k32, k6, 2), p, 2);
flip(k5)
[q, k4] = gfdeconv(gfconv(k6, k6, 2), p, 2);
flip(k4)
[q, k3] = gfdeconv(gfconv(k4, k6, 2), p, 2);
flip(k3)
[q, k256] = gfdeconv(gfconv(k4, k4, 2), p, 2);
[q, k2] = gfdeconv(gfconv(k256, k256, 2), p, 2);
flip(k2)
[q, k1] = gfdeconv(gfconv(k2, k6, 2), p, 2);
flip(k1)
[u, r] = gfdeconv(gfconv(o32, o32, 2), p, 2);
flip(u)

This should work for any "non-reflected" CRC up to at least 32 bits (maybe 64, but I haven't thought it through carefully). For a "reflected" CRC you have to apply the changes mentioned in the paper. For sizes other than 32 bits you also need to adjust the end stage reductions.

BTW, all these gfconv()s map to pclmul instructions (there are no large operands or results, and they could also be coded as 1-bit loops). The gfdeconv()s can all be calculated with a 1-bit CRC shift register loop. What I'm saying here is you could easily write a tool in Rust to calculate these upfront for every CRC in crc-catalog, or even fall back to calculating them on the fly while creating a Crc object. It would still be faster than initializing a byte-wide CRC table!

@ghost
Copy link

ghost commented Apr 16, 2022

Ah, heck. Here's a proof of concept that calculates the coefficients for all the 32-bit CRCs in crc-catalog.

[Edit: Code updated for simplicity.]

// proof of concept by mycroft

fn multiple(a: u64, b: u64) -> u64 {
  (0..32).fold(0, |num, bit| ((((a >> bit) & 1) * b) << bit) ^ num)
}

fn remainder(a: u64, p: u64) -> u64 {
  let mut r = a;
  for _ in 0..32 {
    let c = r >> 63;
    r = (r << 1) ^ (c * (p << 32));
  }
  r >> 32
}

fn quotient(a: u64, p: u64) -> u64 {
  let mut q = 0;
  let mut r = a;
  for _ in 0..32 {
    let c = r >> 63;
    q = (q << 1) ^ c;
    r = (r << 1) ^ (c * (p << 32));
  }
  q
}

fn calc_non_reflected(n: &str, p: u64) {
  let x32 = remainder(0x100000000, p);			//
  let x64 = remainder(multiple(x32, x32), p);
  let x96 = remainder(multiple(x64, x32), p);
  let x128 = remainder(multiple(x64, x64), p);
  let x192 = remainder(multiple(x128, x64), p);
  let x256 = remainder(multiple(x128, x128), p);	//
  let x512 = remainder(multiple(x256, x256), p);
  let x576 = remainder(multiple(x512, x64), p);

  println!("coefficients for {}:", n);
  println!("p = {:X} (non-reflected)", p);
  println!("k1 = {:X}", x576);
  println!("k2 = {:X}", x512);
  println!("k3 = {:X}", x192);
  println!("k4 = {:X}", x128);
  println!("k5 = {:X}", x96);
  println!("k6 = {:X}", x64);
  // 1<<64 is one bit too large, so we hand code one iteration of division and add the bit
  println!("u = {:X}", 0x100000000 | quotient(p << 32, p));
  println!("");
}

fn rev(p: u64) -> u64 {
  let mut p = p;
  let mut r = 0;
  for _ in 0..33 {
    r = (r << 1) ^ (p & 1);
    p >>= 1;
  }
  r
}

fn calc_reflected(n: &str, p: u64) {
  let x32 = remainder(0x100000000, p);
  let x64 = remainder(multiple(x32, x32), p);
  let x96 = remainder(multiple(x64, x32), p);
  let x160 = remainder(multiple(x96, x64), p);
  let x224 = remainder(multiple(x160, x64), p);		//
  let x256 = remainder(multiple(x224, x32), p);		//
  let x480 = remainder(multiple(x256, x224), p);
  let x544 = remainder(multiple(x480, x64), p);

  println!("coefficients for {}:", n);
  let p_p = rev(p);
  println!("p' = {:X} (reflected)", p_p);
  println!("k1' = {:X}", remainder(rev(x544), p_p));
  println!("k2' = {:X}", remainder(rev(x480), p_p));
  println!("k3' = {:X}", remainder(rev(x160), p_p));
  println!("k4' = {:X}", remainder(rev(x96), p_p));
  println!("k5' = {:X}", remainder(rev(x64), p_p));
  println!("k6' = {:X}", remainder(rev(x32), p_p));
  // 1<<64 is one bit too large, so we hand code one iteration of division and add the bit
  println!("u' = {:X}", rev(0x100000000 | quotient(p << 32, p)));
  println!("");
}

fn main() {
  calc_non_reflected("AIXM", 0x1814141AB);
  calc_reflected("AUTOSAR", 0x1F4ACFB13);
  calc_reflected("BASE91_D", 0x1A833982B);
  calc_non_reflected("BZIP2/802.3/802.5/802.11/FDDI/CKSUM/MPEG_2", 0x104C11DB7);
  calc_reflected("CD_ROM_EDC", 0x18001801B);
  calc_reflected("ISCSI", 0x11EDC6F41);
  calc_reflected("ZIP/ISO_HDLC/JAMCRC", 0x104C11DB7);
  calc_non_reflected("XFER", 0x1000000AF);
}

@ghost
Copy link

ghost commented Apr 16, 2022

BTW the iSCSI coefficients I generated match IBM's implementation:
s390 crypto acceleration

@ghost
Copy link

ghost commented Apr 16, 2022

One last thing. If the leading bit (1<<32) of a reversed K value is 1, we can fold it into the lower 32 bits by dividing against the reversed P', limiting the K values to 32 bits. This is alluded to in the paper but not fleshed out. Mathematically it's part of an implicit divide step we did by left shifting the K value one bit when we bit-reversed it. (Since I was doing a 33-bit reversal on P and U, I just reused the same function, so the shift isn't obvious.)

All of this is an adjustment for the fact that in the bit-reversed case we wanted our 32x64 multiply to end up in the high part of the 128-bit result, but it ended up in the low 95 bits, effectively multiplying it by 2^33. So we're dividing all the K values by 2^33 (relative to the non-reflected case) to compensate without additional operations.

The net result is that these K values are exactly equivalent in function:

-const K1: i64 = 0x154442bd4;
-const K2: i64 = 0x1c6e41596;
-const K3: i64 = 0x1751997d0;
-const K4: i64 = 0x0ccaa009e;
-const K5: i64 = 0x163cd6124;
-const K6: i64 = 0x1db710640;
+const K1: i64 = 0x8f352d95;
+const K2: i64 = 0x1d9513d7;
+const K3: i64 = 0xae689191;
+const K4: i64 = 0xccaa009e;
+const K5: i64 = 0xb8bc6765;
+const K6: i64 = 0x00000001;

We can actually just call our remainder() function to do this:

-  println!("p' = {:X} (reflected)", rev(p));
-  println!("k1' = {:X}", rev(x544));
-  println!("k2' = {:X}", rev(x480));
-  println!("k3' = {:X}", rev(x160));
-  println!("k4' = {:X}", rev(x96));
-  println!("k5' = {:X}", rev(x64));
-  println!("k6' = {:X}", rev(x32));
+  let p_p = rev(p);
+  println!("p' = {:X} (reflected)", p_p);
+  println!("k1' = {:X}", remainder(rev(x544), p_p));
+  println!("k2' = {:X}", remainder(rev(x480), p_p));
+  println!("k3' = {:X}", remainder(rev(x160), p_p));
+  println!("k4' = {:X}", remainder(rev(x96), p_p));
+  println!("k5' = {:X}", remainder(rev(x64), p_p));
+  println!("k6' = {:X}", remainder(rev(x32), p_p));

BTW, after doing this, you'll notice that K6 is 1 for every reflected CRC. This is because it's actually 2^31 bit-reversed.

@lizhanhui
Copy link

Hi @srijs I see this crate targets 'Fast, SIMD-accelerated CRC32 (IEEE) checksum computation'. Would this feature and potential pull request be considered? For example, support CRC-32C.

@valaphee
Copy link

valaphee commented Dec 9, 2023

Implemented a pclmulqdq-based crc generator in mrhooray/crc-rs#109, which is similarly abstracted as crc64fast.

It's also slightly faster then crc32fast as it uses aligned simd and thus can better encode some load ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants