-
-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make author and committer date roundtrip #1935
base: main
Are you sure you want to change the base?
Conversation
d190af2
to
47db249
Compare
In terms of performance, the micro benchmark result looks good:
For the macro benchmark, the wall time of the entire things is similar/slightly improved, but
So if I read this correctly, traversing the graph went from 10.83s down to 10.55s but estimate hours went from 28ms to 190ms. I would read this as an acceptable trade-off. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for making this happen, and apologies for the late response.
I did play with ein t hours
and profiled it. There it showed that the reason for the slowdown seems to be the unicode-space splitting. Doing this…
diff --git a/gix-actor/src/lib.rs b/gix-actor/src/lib.rs
index 67ce39504..23b9a3f2f 100644
--- a/gix-actor/src/lib.rs
+++ b/gix-actor/src/lib.rs
@@ -98,8 +98,9 @@ pub struct SignatureRef<'a> {
impl SignatureRef<'_> {
/// Seconds since unix epoch from the time
pub fn seconds(&self) -> gix_date::SecondsSinceUnixEpoch {
+ use winnow::stream::AsChar;
self.time
- .split(u8::is_ascii_whitespace)
+ .split(|b| b.is_space())
.next()
.and_then(|i| i.to_str().ok())
.and_then(|s| s.parse().ok())
Seemed to not make a significant difference, even though I also didn't see it in the profile anymore.
This probably doesn't matter though as I also realized that turning time: gix_date::Time
into a BString
feels wrong, besides being a worst-case change for the downstream which at this point we have to consider.
I also doubt that this will be it, i.e. there is already some special casing of user names and emails, and flexible parsing of whitespace, which also wouldn't be restored correctly.
Hence I'd think that if it was parsed, there should be some sort of raw: Option<BString>
field that contains the entire line. Only then round-tripping will work for sure, at least with the signature field.
Doing so will open up gates to hell when there is mutable fields and a raw
field that go out of sync, when present.
This makes me question the whole approach, and I'd prefer to say that once the user has a non *Ref
version of something, it won't round-trip. Or said differently, once the user has potentially changed an object, it will be written 'cleanly' without extra spaces and whatever else was leniently parsed.
This makes me want to understand the use-case more. When does one decode and re-encode an object, and expect it to be the same byte for byte? During FSCK? Fair, but then it should actually be interesting that it's non-conforming and one could show a diff with the lines that don't match anymore.
Is it cherry-picking? Then I'd understand why we are specifically interested in the signatures, even though I wonder if there is much value in not writing it out verbatim, and not a 'cleaned' version, with non-significant bytes removed.
While I understand in theory, in practice, I think this may be enough for the signature field.
Just to be sure I understand your suggestion precisely, are you advocating for having I see two potential issues with this:
In Mononoke, we implemented a "Git server" that's not backed by the file system, but by a "blobstore" (conceptually, db storage of bytes). It speaks the Git protocol. There are two places we get Git bytes:
It's a choice that may not be absolutely set-in-stone, but we tend to use the owned version of Anyway, to get back to the issue at hand, I think we should maintain round-tripping between FYI, I'm currently tackling the long tail of repos that have caused issues (failure for us to import them in Mononoke or to clone them from Mononoke) out of >7000 repos, only ~30 couldn't be imported yet, and with the other PRs I made and this one, there is a single issue left that I'm aware of and hasn't been adressed yet (a strange way to encode the PGP signature. PR soon). So at least experimentally, we're not too far from handling every edge case. Once all this code is upstream, I'll make sure we can clone each repo again, to catch any regression or omission. |
`gix_date::Time` can be parsed in two ways: * From raw bytes (in the data format Git adheres too) * From the date format used in gix config Add utility constructors for these two cases. When parsing from raw bytes, also support a format like `<sign><HH><MM><SS>` since this is a situation that happens in the wild and that Git can handle.
By storing the raw bytes from Git instead of parsing them into a `gix_date::Time` from the start, we are able to roundtrip between Git and gix-actor even with creatively formatted Git data. Also add a test case that shows a time offset seen in the wild which would have failed to round-trip without this change.
Use the committer date and author date that are now backed by bytes and interpret these bytes into a `gix_date::Time` on demand.
47db249
to
4701f25
Compare
This push was just rebasing and applying the suggestion above:
|
Sorry for the late reply, I feel off a cliff there, unrelated to this issue which, admittedly, is among the big 'small' problems to solve. However, I think it will also help to define a boundary for what round-tripping means, and to what extend that's possible. Just as a model I have in my head, we have
If we break downstream to support more cases of round-tripping, I'd want to do it so that it's done for good. Looking at However,
It's true, even though I think it's more subtle as the difference is already in the types themselves.
They live in the original byte buffer.
I see, it's really valuable to see where the requirement is coming from, and I get why it currently is done as it is.
Thanks, that's cool :)! And it's even open-source last time I checked.
Here I don't understand is why objects are decoded, without mutating them, to then be re-serialized in order to obtain their hash.
I agree, and right now it seems that
I agree.
I am excited about this! For this PR, I think the problem is that it feels too wrong to have I did my best but apologize if some of what I wrote is off or incoherent, I had like 10 interruptions 😅👧.
This makes me hopeful. Maybe you can spike a case of mononoke parallelism which doesn't pass |
No worries at all. I really appreciate the time you are taking to very thoroughly weigh our options and to explain the wider vision so I can fumble in the dark a little less :) Let me suggest 3 possible way forwards, and I will follow your lead on which one to implement: Option 1:
|
Thanks for the analysis and for deriving options! Some of these I implicitly discarded, but it's great to see them written out. Bringing up the Thus, let's go ahead with Option 3. I'd still expect that Lastly, if you wanted to truly gift Thanks again, it's my pleasure to work with you. |
Store the bytes in raw Git format in the Signature, allowing round-tripping even for creatively formatted git data.
Whenever we need to interpret the time, parse it on demand.
Note that the most common need for interpreting the time is limited to its
seconds
part.We expose a way to parse the
seconds
without parsing the rest to avoid unnecessary costs.Fixes issue 1923