Proposal: Good Data Conversion #1888

sampsyo · 2024-02-02T19:54:56Z

sampsyo
Feb 2, 2024
Maintainer

This is a brief writeup of a proposal that @bcarlet and I have recently pitched to @Angelica-Schell.

The Calyx ecosystem has always relied on a bespoke data conversion system for running programs. This conversion is a part of fud, and it translates between flat "binary blobs" and human-readable JSON files that look like this.

The JSON representation is better for humans and other software-y tools, and the flat binary blobs (or their hex-encoded equivalent) are necessary for driving hardware simulation (or real hardware execution). This conversion stuff is a load-bearing component for a couple of reasons:

Sheer practicality: This makes it possible to run programs without a hex editor, and without manually crafting the bits for all your numbers. You can write your inputs in an actual text file, and you can read the outputs with cat. Life would be much worse without this.
Format abstraction: you can write your inputs once, as friendly decimal strings, and convert them to N different numerical formats with no further effort. (This usually involves some approximation, necessarily.)

As part of our numerics efforts, the current Python tooling is showing some limitations:

It is kinda slow.
Its correctness is not 100% clear, especially w/r/t rounding and approximating stuff that doesn't get represented precisely in a IEEE double-precision float.
It probably doesn't support arbitrary precision (i.e., BigInts or BigFloats).
The JSON format itself is a little strange; it's not clear why the format is a part of the file (rather than metadata separate from the file), and the specific JSON encoding of various numerical formats is pretty arbitrary.
It is intertwined with fud; it would be awesome if it were a separate (and separately tested) tool.

So the proposal here is to build, in Rust, a new standalone converter for numerical data blobs. The new converter would—in the limit—support arbitrary-precision fixed-point and floating-point formats, and it would convert between that binary data and various human-readable text file formats.

That is, there are two orthogonal axes going on here:

The numerical format (e.g., 10-bit integer, IEEE 32-bit float, bfloat16, fixed-point with 4 integer and 2 fractional bits).
The file format (e.g., binary, hex, JSON, newline-separated text).

We would like a tool that converts from any file/numerical format to any other file/numerical format, and which does so with correct rounding (i.e., with the best accuracy achievable given the destination format).

Some Details

Here are some file formats I think we should support:

The existing, fud-style JSON format, as linked above. This is good for compatibility. Interesting characteristics of this file format include that multiple memories can coexist in a single file, and the file itself specifies the numerical format to use when converting.
Flat binary blobs. This is the most straightforward possible format: the actual bits for every number are written out end to end. Only one memory per file. (So to convert a collection of different memories into this format, we need to generate an entire directory of binary files.) This is the format that Verilog's readmemb/writememb use, and it is also the "raw" format we would use to feed data into actual hardware.
The above, but hex-encoded as text. Also only one memory per file. This is the format that Verilog's readmemh/writememh use. This is the only other file format (beyond its native JSON) that the current fur-embedded converter works.
Newline-delimited text files, i.e., one ASCII decimal number per line. (And one memory per file.) This is a new format this meant to be a more direct reflection of the binary/hex formats, but converted to decimal text for human reading & writing.
Perhaps some other non-JSON text format we invent to allow multiple (named) memories per file? This would be a successor to the fud JSON file format. The reason to explore such a thing is that JSON, while the spec doesn't actually define an interpretation of number literals, is de facto constrained to doubles because so many implementations (including the web platform's JSON.parse, which is arguably the OG) have that limitation. And this would also take the numerical format out of the file, probably.

Here are some numerical formats I think we should support:

Integers. Got to love them. Unsigned and signed (two's complement). Maybe these are a special case of fixed point (below) rather than being their own thing.
IEEE floating point, i.e., float, double, double double, 16-bit half precision.
Generalized IEEE floating point. While IEEE 754 actually only defines a short list of formats, it's not hard to generalize the definition to arbitrary mantissa and exponent widths. bfloat16, for example, is just one setting of these parameters.
Fixed point. One important thing we have learned recently is that it's probably best to describe fixed-point formats in an unconventional way: the two parameters are total bit width and exponent. Like, if width=8 and exponent=-4, then the 8-bit bit-pattern $n$ encodes the value $n \times 2^{-4}$. This is a bit different from, like, TI's Q notation, where the two parameters are integer bit width and fractional bit width. The reasons for this preference are (a) it's more similar to floating point, and (b) it lets you represent really big and really small numerical ranges without a bunch of extraneous zeroes. Anyway, we should support both unsigned and signed fixed-point numbers.

At the most basic level, we want a tool you can invoke like this:

awesome_conversion_tool --from-format s17 --from-file hex --to-format double --to-file bin < something.hex > something.bin

That is, the 4 different formats involved in a conversion should be specified separately. With the possible exception of the text formats (plain text and JSON), in which case the representation comes with its own precision. That is, the decimal string "123.456" doesn't correspond to any specific numerical format; we just want to represent a number as close as possible to that decimal value.

(FWIW, this business about accurately representing text/decimal numbers is the part I am most fuzzy on. I guess we need to pick a binary format that is guaranteed to be a fully faithful representation of any such binary number, in some range? What is the set of decimal numbers that can be exactly represented (i.e., round-tripped without error) in at most 64 bits under some strategy? We would have to figure out what to do here without going fully into "BigRational" representations if it's not necessary…)

How to Approach It

I suggest that we start with these limitations:

Start with two formats only: plain (newline-delimited) text and binary. Maybe hex also just for readability.
No arbitrary precision. We only support fixed-point formats where the total number of bits per value is < 64, or whatever other restriction makes sense. (Eventually, we will want arbitrary precision, so you can have like 80-bit values. But by sidestepping this, we can avoid wrangling with a BigInt library or anything like that.)
Maybe not even floating point for now. We can come back to that later.
No need to hyper-optimize for speed for now. Just having it implemented in Rust will be plenty for now, and we can explore lots of other optimizations in the future.

Once all that works, we can adopt fud's JSON format for compatibility. That will let us start using it in real Calyx executions! Then we can start addressing more of the above desiderata, including new file formats and optimizations for speed.

Related Work

I have actually been super surprised to find that there are apparently not any tools like this already out there. It seems like something the world of hardware tooling would want? I just can't find much at all on GitHub, and certainly nothing with a focus on correctness. But I'd be very interested if other people know about something I don't!

I actually gave a very half-hearted try to implementing something along these lines long ago, in a repo called samizdat. This was a failure. I don't think I even learned very much from the experience, except that this is a deceptively interesting problem. Perhaps the only salvageable code from that effort might be my little implementation of a numerical format enum, which also includes a string representation, so "f32" means single-precision float and "u4.2" means "unsigned fixed point with 4 integer and 2 fractional bits." Of course, I used the wrong parameters for fixed-point formats! So maybe even this is not very useful.

sampsyo · 2024-02-02T20:45:27Z

sampsyo
Feb 2, 2024
Maintainer Author

I guess we need to pick a binary format that is guaranteed to be a fully faithful representation of any such binary number, in some range?

Perhaps the following obvious strategy would not be a bad place to start: take a decimal string $X.Y$, split it on the ., parse the integer part $X$ and the fraction part $Y$ separately as usizes. Now you have two values that, of course, together can precisely represent the number within some limited range. Now you do math on those two numbers to get the representation you want.

This would be an alternative, for example, to starting with f64::from_str and going from there, or even something like BigRational::from_str. It may cause other problems, but at least it starts from a baseline that is perfectly precise (within easily stated bounds).

0 replies

bcarlet · 2024-02-02T23:47:02Z

bcarlet
Feb 2, 2024
Collaborator

Sounds great! I would probably add hexadecimal floats (e.g., 0x1.921c8p+1) to the list of textual representations. These are nice since they're independent of the numerical format and (somewhat) human-readable, while also letting you represent bit-exact values without ballooning to hundreds of decimal digits. This isn't practically achievable with fud's current design choice of interpreting decimal numbers as exact values, as opposed to, say, rounding to the nearest double.

1 reply

sampsyo Feb 4, 2024
Maintainer Author

Great point!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Proposal: Good Data Conversion #1888

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

The Calyx Infrastructure

Proposal: Good Data Conversion #1888

sampsyo Feb 2, 2024 Maintainer

Some Details

How to Approach It

Related Work

Replies: 2 comments · 1 reply

sampsyo Feb 2, 2024 Maintainer Author

bcarlet Feb 2, 2024 Collaborator

sampsyo Feb 4, 2024 Maintainer Author

sampsyo
Feb 2, 2024
Maintainer

Replies: 2 comments 1 reply

sampsyo
Feb 2, 2024
Maintainer Author

bcarlet
Feb 2, 2024
Collaborator

sampsyo Feb 4, 2024
Maintainer Author