-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bytea casts #21
base: master
Are you sure you want to change the base?
Add bytea casts #21
Conversation
Thank you, it is an interesting use case. A few thoughts... I understand that, as the I wonder if, instead of a cast, it wouldn't be appropriate to expose some import/export functions instead? The cast mnz to bytea is something I have used heavily for development to deal with the structure as a whole (i.e. a no-function cast) and I wouldn't really want to lose that feature. Something that ties probably with your need to convert between binary and mpz is the implementation of binary send/receive functions (see #5). It would be mandatory that those functions dealt with the sign though. All in all it seems a feature surely useful for use case, but because it is not completely generic (it doesn't cover the entire mpz domain) maybe it would be better implemented as functions. I don't think there is any performance difference between a cast and a function, right? |
It makes sense to me that functions that don't deal with the entire domain would not make for valid casts or send/receive functions. We've only been dealing with encodings of values that are always-unsigned, so this hasn't been a problem for us so far, but it's definitely important to make choices that are widely applicable. My thoughts:
This would serve our own use-case pretty well — our production DB value columns could be (quite correctly) re-typed to be Even without the introduction of all these additional types and casts, we could still put them in our own DB, as long as the core arity-3 functions were there to define them in terms of. But I feel that at least the introduction of |
85c3600
to
91b6e74
Compare
I've started working on making the changes I suggested earlier. (Our data scientist found a place where we need to interpret our I've added an explicit PG function, for now called I'm unsure whether this is the most efficient implementation for doing two's complement absolute-value "during" a libgmp import of a stream of bytes. I couldn't come up with a good way to get libgmp to do the whole absolute-value step itself (it seems like it'd require an allocation of a temporary mpz to hold an appropriate xor value), so I had to do part of it before the import, requiring an extra Also, again, let me know whether you think the design I outlined above (with the DOMAIN types et al) is one worth pursuing / one you'd want to have shipped as part of the extension, before I go and commit to actually implementing all of that. |
Submitted for consideration of whether this is something useful to upstream. This shouldn't be merged as-is (no docs or benchmarks yet.)
This patch enables efficient interconversion between
mpz
andbytea
, where thebytea
is interpreted as a "packed big-endian" or "base-256" bitstring representation of an integer.Our company works to analyze data sourced from Ethereum, where most numeric data is represented as a "uint256" (256-bit unsigned integer) type, usually transmitted serialized as a hex value. We store billions of these uint256 values in our DB. We index them, aggregate over them, and also bulk-
encode(them, 'hex')
for presentation. Sometimes they're actually numbers. Sometimes they're not. Thinking of the raw data as something more like "arbitrary contents of a 32-byte-wide machine vector register" might make more sense.We have found that storing the data as
numeric
, while efficient for math, is highly inefficient for converting-to-hex (it's very difficult to write a native base conversion routine betweennumeric's
base-10000, and hex's base-16.) Storing the data as anmpz
would almost work, but breaks wire compatibility for clients that want to consume the data in its native binary representation (e.g. Elixir'sPostgrex
library.)Ultimately, we have chosen to store these values in Postgres as raw
bytea
s. This gives us the highest storage efficiency; allows us to use the native, highly-efficient Postgres functionencode(col, 'hex')
; and also is a lossless transformation from the original hex-encoded representation, for cases where the value turns out to be non-numeric (e.g. a packed struct) where we'd want to retain and re-create leading zeroes on encode.A
bytea
would normally have no efficient path to performing math operations upon, but with this patch, we can cheaply castbytea
(base-256) values tompz
(base-4294967296), perform the aggregate, and then encode the result as hex (or back tobytea
).This has worked exceedingly well for us so far. We have been using this patch in production for around two years now, with no hiccups.
The only issue with it, is that it's not upstreamed, so we have to manually build and install our own fork of
pgmp
for every Postgres instance we run!If you like this code/the idea behind it, let me know what should be done to polish it up and get it ready to contribute. Thanks!
P.S. In our production databases, we have also defined implicit assignment casts between the
numeric
andbytea
types, that take the value throughmpz
as an intermediate representation. The cast frommpz
tonumeric
is not particularly efficient, but due to GMP's highly-efficient data structures, it seemed to still be cheaper for bulk conversions than the memory access pattern created by the naive direct base-256 to base-10000 conversion routine I wrote as an alternative. (Though, obviously,pmpq_to_numeric
could probably still be optimized further; it allocates an intermediate string!)However efficient it is, it's definitely a better option than doing this base-conversion in PL/pgSQL. And that fact—plus having it "built in" to a library that gets packaged by Debian et al—is "good enough" for us, and probably most people. As such, it might make sense to consider having this library also define
bytea
↔numeric
casts, iff the DB doesn't already have them.