-
Notifications
You must be signed in to change notification settings - Fork 47
feature(hash): refactor sha256 hash api #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8838aba
to
5985c27
Compare
Codecov Report✅ All modified and coverable lines are covered by tests.
... and 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
5985c27
to
20694bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just one small comment on a missed signature_count being hashed in conformance and small style preference
20694bb
to
be5589e
Compare
pub fn init(data: []const u8) Hash { | ||
var out: [32]u8 = undefined; | ||
Sha256.hash(data, &out, .{}); | ||
return .{ .data = out }; | ||
} | ||
|
||
/// re-hashes the current hash with the mixed-in byte slice(s). | ||
pub fn extendAndHash(self: Hash, data: anytype) Hash { | ||
return generateSha256(.{ self.data, data }); | ||
/// Does the same thing as `init`, but updates the hash with each | ||
/// input slice from the `data` list. | ||
pub fn initMany(data: []const []const u8) Hash { | ||
var new = Sha256.init(.{}); | ||
for (data) |d| new.update(d); | ||
return .{ .data = new.finalResult() }; | ||
} | ||
|
||
fn update(hasher: *Sha256, data: anytype) void { | ||
const T = @TypeOf(data); | ||
|
||
if (T == Hash or T == *const Hash or T == *Hash) { | ||
hasher.update(&data.data); | ||
} else if (@typeInfo(T) == .@"struct") { | ||
inline for (data) |val| update(hasher, val); | ||
} else if (std.meta.Elem(T) == u8) switch (@typeInfo(T)) { | ||
.array => hasher.update(&data), | ||
else => hasher.update(data), | ||
} else { | ||
for (data) |val| update(hasher, val); | ||
} | ||
/// re-hashes the current hash with the mixed-in byte slice(s). | ||
pub fn extend(self: Hash, data: []const u8) Hash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this API is an improvement. This makes it less ergonomic. What's the benefit? I would prefer to keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reflection is not necessary, introduces a ton of code bloat, and makes it impossible to optimize hashing in any meaningful way. I believe this PR only proves that the reflection is not necessary, with how small a diff was required to conform to exactly []const []const u8
when you just think a bit about the code around callsites. We need to stop abstracting every little thing in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reflection is not necessary,
I agree.
introduces a ton of code bloat
You're saying because there are separate versions of this function for every usage?
Let's say we decided to eliminate these functions and forced every call site to directly use the hasher. I don't see code bloat as a major problem with that. Do you think this function introduces more bloat than inlining the usage of the hasher?
makes it impossible to optimize hashing in any meaningful way.
I don't understand why, but regardless I think this only justifies the change when it coincides with optimizations.
think a bit about the code around callsites
This is the annoyance I'd like to avoid. I'd rather think about it only the one time that I write this function instead of every time I use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're saying because there are separate versions of this function for every usage?
Almost all usages create an anonymous tuple over which the function becomes generalized. So pretty much, yes.
Let's say we decided to eliminate these functions and forced every call site to directly use the hasher. I don't see code bloat as a major problem with that. Do you think this function introduces more bloat than inlining the usage of the hasher?
Yes, it does. As I said above, it creates a new version of the function for nearly all our usages, which bloats the binary and makes inlining the hasher implementation itself nearly impossible (LLVM pretty much always refuses to do it, because of how many call sites there are). Using the hasher directly would also cause this problem in most cases, which is specifically why I don't do that. I would honestly prefer it, but since we are looking for maximizing performance, having a single (or a couple in our case, we can have specialized ones for PoH, which I was working on a while ago) entry point, which is then unrolled properly a single time, is optimal.
This is the annoyance I'd like to avoid. I'd rather think about it only the one time that I write this function instead of every time I use it.
You want to be thinking about it each time. Hashing is an expensive operation, and there are a lot of hashes that need to be done over the course of the validator. Increasing friction at those places is a good thing.
I believe this sort of mindset is one of the main driving factors behind our current performance problems. We are trying to hide everything away behind abstractions; cloning everything, refcounting everything, allocating everything. This leads to the death-by-a-thousand-cuts situation that we are currently in.
We need to start thinking about the performance of every little thing. The validator itself is not a particularly complex piece of software; however, it has a lot of moving parts and nuances. All of them are small and seemingly inexpensive, and they all need to happen. Adding them up together leads to a lot of performance loss.
This is a WIP PR where I focus on improving performance around sha256 related things. The first commit re-organizes the API a bit to make it easier to streamline the hashing later on. Single hashes do not need to go through a incremental API and instead can be pipelined when done in a loop, something to-be-implemented. And when doing mixins with just two inputs of a known size, we can pipeline that as well, performing much faster hashing than what we currently achieve.