Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make author and committer date roundtrip #1935

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pierrechevalier83
Copy link
Contributor

Store the bytes in raw Git format in the Signature, allowing round-tripping even for creatively formatted git data.

Whenever we need to interpret the time, parse it on demand.

Note that the most common need for interpreting the time is limited to its seconds part.
We expose a way to parse the seconds without parsing the rest to avoid unnecessary costs.

Fixes issue 1923

@pierrechevalier83 pierrechevalier83 force-pushed the fix_1923 branch 5 times, most recently from d190af2 to 47db249 Compare April 6, 2025 13:11
@pierrechevalier83
Copy link
Contributor Author

In terms of performance, the micro benchmark result looks good:

cargo bench
   Compiling gix-date v0.9.4 (/home/pierrec/Documents/code/gitoxide/gix-date)
   Compiling gix-actor v0.34.0 (/home/pierrec/Documents/code/gitoxide/gix-actor)
   Compiling gix-object v0.48.0 (/home/pierrec/Documents/code/gitoxide/gix-object)
   Compiling gix-pack v0.58.0 (/home/pierrec/Documents/code/gitoxide/gix-pack)
   Compiling gix-odb v0.68.0 (/home/pierrec/Documents/code/gitoxide/gix-odb)
    Finished `bench` profile [optimized] target(s) in 16.02s
     Running unittests src/lib.rs (/home/pierrec/Documents/code/gitoxide/target/release/deps/gix_object-c0f068d08025b61a)

running 9 tests
test commit::message::body::test_parse_trailer::extra_whitespace_before_token_or_value_is_error ... ignored
test commit::message::body::test_parse_trailer::simple_newline ... ignored
test commit::message::body::test_parse_trailer::simple_newline_windows ... ignored
test commit::message::body::test_parse_trailer::simple_non_ascii_no_newline ... ignored
test commit::message::body::test_parse_trailer::with_lots_of_whitespace_newline ... ignored
test data::tests::size_of_object ... ignored
test tag::write::tests::validated_name::invalid::leading_dash ... ignored
test tag::write::tests::validated_name::invalid::only_dash ... ignored
test tag::write::tests::validated_name::valid::version ... ignored

test result: ok. 0 passed; 0 failed; 9 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/decode_objects.rs (/home/pierrec/Documents/code/gitoxide/target/release/deps/decode_objects-7eb5f76b9d6d7b1a)
CommitRef(sig)          time:   [946.96 ns 948.28 ns 949.84 ns]
                        change: [-14.862% -14.276% -13.739%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

CommitRefIter(sig)      time:   [1.0171 µs 1.0186 µs 1.0202 µs]
                        change: [-16.571% -15.811% -15.067%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

TagRef(sig)             time:   [266.43 ns 266.68 ns 266.94 ns]
                        change: [-25.598% -25.104% -24.627%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

TagRefIter(sig)         time:   [249.97 ns 250.41 ns 250.84 ns]
                        change: [-22.893% -22.645% -22.422%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

TreeRef()               time:   [125.78 ns 125.91 ns 126.07 ns]
                        change: [-6.6320% -6.0969% -5.5535%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

TreeRefIter()           time:   [58.774 ns 58.887 ns 59.007 ns]
                        change: [-2.2053% -1.9213% -1.6651%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

     Running benches/edit_tree.rs (/home/pierrec/Documents/code/gitoxide/target/release/deps/edit_tree-5b7b3aa270061123)
editor/small tree (empty -> full -> empty)
                        time:   [3.4845 µs 3.4954 µs 3.5145 µs]
                        thrpt:  [2.8454 Melem/s 2.8609 Melem/s 2.8699 Melem/s]
                 change:
                        time:   [-3.8161% -3.5141% -3.2192%] (p = 0.00 < 0.05)
                        thrpt:  [+3.3263% +3.6421% +3.9675%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high severe
editor/deeply nested tree (empty -> full -> empty)
                        time:   [10.048 µs 10.053 µs 10.057 µs]
                        thrpt:  [4.5739 Melem/s 4.5758 Melem/s 4.5778 Melem/s]
                 change:
                        time:   [-8.9378% -8.0382% -6.8045%] (p = 0.00 < 0.05)
                        thrpt:  [+7.3013% +8.7408% +9.8150%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

cursor/small tree (empty -> full -> empty)
                        time:   [3.6144 µs 3.6167 µs 3.6192 µs]
                        thrpt:  [2.7631 Melem/s 2.7650 Melem/s 2.7667 Melem/s]
                 change:
                        time:   [-1.3927% -1.1702% -0.9816%] (p = 0.00 < 0.05)
                        thrpt:  [+0.9914% +1.1841% +1.4124%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
cursor/deeply nested tree (empty -> full -> empty)
                        time:   [3.8258 µs 3.8282 µs 3.8308 µs]
                        thrpt:  [12.008 Melem/s 12.016 Melem/s 12.024 Melem/s]
                 change:
                        time:   [-0.5253% -0.4351% -0.3505%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3518% +0.4370% +0.5281%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

For the macro benchmark, the wall time of the entire things is similar/slightly improved, but estimate-hours slowed down a lot if I read the results correctly:

⚡ ~/Documents/code/gitoxide/target/release/ein t hours
 14:32:04 traverse commit graph done 1.4M commits in 10.83s (124.7K commits/s)
 14:32:05        estimate-hours Extracted and organized data from 1351062 commits in 28.093709ms (48091264 commits/s)
total hours: 1281489.00
total 8h days: 160186.12
total commits = 1351062
total authors: 35615
total unique authors: 27286 (23.39% duplication)
  • On this branch:
⚡ ~/Documents/code/gitoxide/target/release/ein t hours
 14:33:57 traverse commit graph done 1.4M commits in 10.55s (128.0K commits/s)
 14:33:59        estimate-hours Extracted and organized data from 1351062 commits in 190.271614ms (7100702 commits/s)
total hours: 1281489.00
total 8h days: 160186.12
total commits = 1351062
total authors: 35615
total unique authors: 27286 (23.39% duplication)

So if I read this correctly, traversing the graph went from 10.83s down to 10.55s but estimate hours went from 28ms to 190ms.

I would read this as an acceptable trade-off.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making this happen, and apologies for the late response.

I did play with ein t hours and profiled it. There it showed that the reason for the slowdown seems to be the unicode-space splitting. Doing this…

diff --git a/gix-actor/src/lib.rs b/gix-actor/src/lib.rs
index 67ce39504..23b9a3f2f 100644
--- a/gix-actor/src/lib.rs
+++ b/gix-actor/src/lib.rs
@@ -98,8 +98,9 @@ pub struct SignatureRef<'a> {
 impl SignatureRef<'_> {
     /// Seconds since unix epoch from the time
     pub fn seconds(&self) -> gix_date::SecondsSinceUnixEpoch {
+        use winnow::stream::AsChar;
         self.time
-            .split(u8::is_ascii_whitespace)
+            .split(|b| b.is_space())
             .next()
             .and_then(|i| i.to_str().ok())
             .and_then(|s| s.parse().ok())

Seemed to not make a significant difference, even though I also didn't see it in the profile anymore.

This probably doesn't matter though as I also realized that turning time: gix_date::Time into a BString feels wrong, besides being a worst-case change for the downstream which at this point we have to consider.

I also doubt that this will be it, i.e. there is already some special casing of user names and emails, and flexible parsing of whitespace, which also wouldn't be restored correctly.
Hence I'd think that if it was parsed, there should be some sort of raw: Option<BString> field that contains the entire line. Only then round-tripping will work for sure, at least with the signature field.
Doing so will open up gates to hell when there is mutable fields and a raw field that go out of sync, when present.

This makes me question the whole approach, and I'd prefer to say that once the user has a non *Ref version of something, it won't round-trip. Or said differently, once the user has potentially changed an object, it will be written 'cleanly' without extra spaces and whatever else was leniently parsed.

This makes me want to understand the use-case more. When does one decode and re-encode an object, and expect it to be the same byte for byte? During FSCK? Fair, but then it should actually be interesting that it's non-conforming and one could show a diff with the lines that don't match anymore.
Is it cherry-picking? Then I'd understand why we are specifically interested in the signatures, even though I wonder if there is much value in not writing it out verbatim, and not a 'cleaned' version, with non-significant bytes removed.

@pierrechevalier83
Copy link
Contributor Author

I also realized that turning time: gix_date::Time into a BString feels wrong, besides being a worst-case change for the downstream which at this point we have to consider.

I also doubt that this will be it, i.e. there is already some special casing of user names and emails, and flexible parsing of whitespace, which also wouldn't be restored correctly.
Hence I'd think that if it was parsed, there should be some sort of raw: Option field that contains the entire line. Only then round-tripping will work for sure, at least with the signature field.
Doing so will open up gates to hell when there is mutable fields and a raw field that go out of sync, when present.

While I understand in theory, in practice, I think this may be enough for the signature field.
In the > 7000 repos we host, we currently have only two exceptions to what gitoxide can handle without this change:

  • The date being of form: "+123400"
  • The date being of form: "+123456" (non-zero seconds)
    So it looks like we should be OK with 3 BStrings.

This makes me question the whole approach, and I'd prefer to say that once the user has a non *Ref version of something, it won't round-trip. Or said differently, once the user has potentially changed an object, it will be written 'cleanly' without extra spaces and whatever else was leniently parsed.

Just to be sure I understand your suggestion precisely, are you advocating for having ObjectRef round-trip with Git, but Object not round-trip with ObjectRef as a preferable situation?

I see two potential issues with this:

  • It feels unintuitive to loose the equivalence between the owned and the non-owned version. The contract being that they can round trip, and the details of ownership being a pure performance consideration makes more sense to me as a user.
  • In terms of ownership, say Object has a sane typed representation of time but ObjectRef needs a BStr. Where do the BStr's owned bytes live?

This makes me want to understand the use-case more. When does one decode and re-encode an object, and expect it to be the same byte for byte? During FSCK? Fair, but then it should actually be interesting that it's non-conforming and one could show a diff with the lines that don't match anymore.

In Mononoke, we implemented a "Git server" that's not backed by the file system, but by a "blobstore" (conceptually, db storage of bytes). It speaks the Git protocol. There are two places we get Git bytes:

  • From http via the Git protocol when the Git client communicates with us
  • We store the Git representation of objects in the blobstore
    For everything else internally, we use gix-object types to represent the actual structure and decide what to do.

It's a choice that may not be absolutely set-in-stone, but we tend to use the owned version of gix-object types by default mainly for convenience, to avoid having to worry about lifetimes when we do largely parallel things. Arguably, we could store the raw bytes and the ref objects in the same struct and stick it all in an Arc whenever we need to do parallel things, but that's not currently what we do.
Our code is not quite as performance sensitive as something like a Git client as we can make trade-offs like throwing hardware at it and the architecture of Mononoke tends to focus on bounding the worst-case computations and mem usage (by sharding unbounded things like files) and trading off storage for compute by pre-computing everything that may be needed at runtime (for instance, the blame asynchronously outside of the write path), so most of what the mononoke_git_server does is fetch bytes from storage and stream them to clients at a high level.
We also clearly make trade-offs of the kind: better anything now than the best thing later as the current system that is backed by Git is much more of a hassle to maintain than even a naive implementation of the Mononoke Git server (and we can optimize once we hit bottlenecks in prod).
For us, blobstore storage is indexed by the Git hash, so if the hash of the gix-object that we are manipulating doesn't match the hash from Git (the one the client asks us about), we get issues.

Anyway, to get back to the issue at hand, I think we should maintain round-tripping between Object and ObjectRef if possible.
We could do it in a stricter way than with BString BStr, for instance by only adding an optional seconds field to gix-date::Time; but it feels like a decision that we may regret if at some point we find another slight variation on how Git may encode the time.
A mix of Object using Time and ObjectRef using BStr would only work with a backing of an array of bytes and that doesn't feel great to me, but I guess it's an option.
Or there is the way in this PR, which I think is practical.
For one Option<BString> for the entire signature, it may turn out to be necessary ultimately, but I haven't seen evidence for it in practice, so I'd rather close that bridge if and when we get there.

FYI, I'm currently tackling the long tail of repos that have caused issues (failure for us to import them in Mononoke or to clone them from Mononoke) out of >7000 repos, only ~30 couldn't be imported yet, and with the other PRs I made and this one, there is a single issue left that I'm aware of and hasn't been adressed yet (a strange way to encode the PGP signature. PR soon).

So at least experimentally, we're not too far from handling every edge case. Once all this code is upstream, I'll make sure we can clone each repo again, to catch any regression or omission.

Pierre Chevalier and others added 4 commits April 8, 2025 14:12
`gix_date::Time` can be parsed in two ways:
* From raw bytes (in the data format Git adheres too)
* From the date format used in gix config

Add utility constructors for these two cases.

When parsing from raw bytes, also support a format like `<sign><HH><MM><SS>`
since this is a situation that happens in the wild and that Git can handle.
By storing the raw bytes from Git instead of parsing them into a
`gix_date::Time` from the start, we are able to roundtrip between Git
and gix-actor even with creatively formatted Git data.

Also add a test case that shows a time offset seen in the wild which
would have failed to round-trip without this change.
Use the committer date and author date that are now backed by bytes and
interpret these bytes into a `gix_date::Time` on demand.
@pierrechevalier83
Copy link
Contributor Author

This push was just rebasing and applying the suggestion above:

-            .split(u8::is_ascii_whitespace)
+            .split(|b| b.is_space())

@Byron
Copy link
Member

Byron commented Apr 11, 2025

Sorry for the late reply, I feel off a cliff there, unrelated to this issue which, admittedly, is among the big 'small' problems to solve. However, I think it will also help to define a boundary for what round-tripping means, and to what extend that's possible.

Just as a model I have in my head, we have

  • decompressed object buffer
    • a &[u8] of bytes that represent the object, it's what is returned by the object database
    • It's also what is stored in gix::Object, it's entirely unparsed.
  • decoded object (*Ref)
    • These are the gix_object::Object variants, where everything is parsed into fields. These fields still refer to the original bytes buffer.
    • it goes as far as having references to the original hex version of object ids for instance, merely as matter of convenience.

If we break downstream to support more cases of round-tripping, I'd want to do it so that it's done for good.

Looking at SignatureRef I also think that time: &BStr would be the way to go, then downstream would call time() to parse it on demand, similar to how it's already done everywhere else.

However, Signature would still contain the fully parsed types.

Just to be sure I understand your suggestion precisely, are you advocating for having ObjectRef round-trip with Git, but Object not round-trip with ObjectRef as a preferable situation?

I see two potential issues with this:

  • It feels unintuitive to loose the equivalence between the owned and the non-owned version. The contract being that they can round trip, and the details of ownership being a pure performance consideration makes more sense to me as a user.

It's true, even though I think it's more subtle as the difference is already in the types themselves. CommitRef for instance uses a BStr for the tree header field, but Commit uses ObjectId. They are equivalent, but the BStr can technically represent anything you would want. Practically, when writing, there is validation, but in case of SignatureRef there most characters are actually allowed.
Probably not a strong argument, but in short, it's already the case and I think it's natural (enough).

  • In terms of ownership, say Object has a sane typed representation of time but ObjectRef needs a BStr. Where do the BStr's owned bytes live?

They live in the original byte buffer.

It's a choice that may not be absolutely set-in-stone, but we tend to use the owned version of gix-object types by default mainly for convenience, to avoid having to worry about lifetimes when we do largely parallel things. Arguably, we could store the raw bytes and the ref objects in the same struct and stick it all in an Arc whenever we need to do parallel things, but that's not currently what we do.

I see, it's really valuable to see where the requirement is coming from, and I get why it currently is done as it is.
ein t hours has a similar problem, as it has to do parallel computation on commits. The way it does it is to produce byte buffers of commits, which are sent verbatim to the consumers where they are decoded. It's very natural, too, and there are no lifetime problems.
Of course, no matter what, if there would be modifications to an object, it will be written as sanitized version at least if one passes it through an owned decoded object.

In Mononoke, we implemented a "Git server" that's not backed by the file system, but by a "blobstore" (conceptually, db storage of bytes). It speaks the Git protocol. There are two places we get Git bytes:

  • From http via the Git protocol when the Git client communicates with us
  • We store the Git representation of objects in the blobstore
    For everything else internally, we use gix-object types to represent the actual structure and decide what to do.

It's a choice that may not be absolutely set-in-stone, but we tend to use the owned version of gix-object types by default mainly for convenience, to avoid having to worry about lifetimes when we do largely parallel things. Arguably, we could store the raw bytes and the ref objects in the same struct and stick it all in an Arc whenever we need to do parallel things, but that's not currently what we do. Our code is not quite as performance sensitive as something like a Git client as we can make trade-offs like throwing hardware at it and the architecture of Mononoke tends to focus on bounding the worst-case computations and mem usage (by sharding unbounded things like files) and trading off storage for compute by pre-computing everything that may be needed at runtime (for instance, the blame asynchronously outside of the write path), so most of what the mononoke_git_server does is fetch bytes from storage and stream them to clients at a high level. We also clearly make trade-offs of the kind: better anything now than the best thing later as the current system that is backed by Git is much more of a hassle to maintain than even a naive implementation of the Mononoke Git server (and we can optimize once we hit bottlenecks in prod).

Thanks, that's cool :)! And it's even open-source last time I checked.

For us, blobstore storage is indexed by the Git hash, so if the hash of the gix-object that we are manipulating doesn't match the hash from Git (the one the client asks us about), we get issues.

Here I don't understand is why objects are decoded, without mutating them, to then be re-serialized in order to obtain their hash.
In my thinking, when getting a mutable/owned version of an object, one wants to mutate it, so getting a different hash and writing clean values are both expected.

We could do it in a stricter way than with BString BStr, for instance by only adding an optional seconds field to gix-date::Time; but it feels like a decision that we may regret if at some point we find another slight variation on how Git may encode the time.

I agree, and right now it seems that SignatureRef would be fine with time being a BStr to be able to capture everything.

For one Option<BString> for the entire signature, it may turn out to be necessary ultimately, but I haven't seen evidence for it in practice, so I'd rather close that bridge if and when we get there.

I agree.

FYI, I'm currently tackling the long tail of repos that have caused issues (failure for us to import them in Mononoke or to clone them from Mononoke) out of >7000 repos, only ~30 couldn't be imported yet, and with the other PRs I made and this one, there is a single issue left that I'm aware of and hasn't been adressed yet (a strange way to encode the PGP signature. PR soon).

So at least experimentally, we're not too far from handling every edge case. Once all this code is upstream, I'll make sure we can clone each repo again, to catch any regression or omission.

I am excited about this!

For this PR, I think the problem is that it feels too wrong to have time: BString in Signature, but it's OK to have in SignatureRef. Doing may require you to change some parts of the server implementation which pass around owned versions of these types, but I think that changing it will not make it less clear overall. In the simplest case, one would hand full byte buffers to the consumer, which decodes the object there.

I did my best but apologize if some of what I wrote is off or incoherent, I had like 10 interruptions 😅👧.

A mix of Object using Time and ObjectRef using BStr would only work with a backing of an array of bytes and that doesn't feel great to me, but I guess it's an option.

This makes me hopeful. Maybe you can spike a case of mononoke parallelism which doesn't pass Object across threads, but plain buffers - an owned version of gix_object::Data<'buf> should be all you need.

@pierrechevalier83
Copy link
Contributor Author

Sorry for the late reply, I feel off a cliff there, unrelated to this issue which, admittedly, is among the big 'small' problems to solve.

No worries at all. I really appreciate the time you are taking to very thoroughly weigh our options and to explain the wider vision so I can fumble in the dark a little less :)
I think we are getting to a common understanding of this problem space.

Let me suggest 3 possible way forwards, and I will follow your lead on which one to implement:

Option 1: gix_date::Time is made to roundtrip and backs both SignatureRef and Signature.

This would consist of making an enum: OffsetFormat with maybe HHMM and HHMMSS variants and stick it in gix-date::Time. It would be analogous to the Sign enum which currently exists to distinguish between +0000 and -0000 offsets.
There may be a tiny performance loss compared to main, but I think it would be acceptable for the improved "correctness".

Option 2: Heterogeneous Signature with roundtrip

We implement the fix described in Option 1, which applies to Signature, but we back SignatureRef with a &BStr for "future-proofing" of handling new weird cases that may show up. We get roundtripping today, but no guarantee of roundtripping without further modification for the owned type in case we hit more exotic scenarios in the wild.

Option 3: Heterogenous Signature without roundtrip

  • SignatureRef keeps having a &BStr and .time() for parsing on demand as in this PR so far
  • Signature goes back to having a gix-date::Time which is unmodified

In all cases, Mononoke would need to tend towards a state where we use the Ref version of objects backed by the raw bytes, but options 1 and 2 would have the advantage of giving us more flexibility on when exactly to make that transition as we would get round-tripping initially, so both would be preferable from my point of view.
From my understanding of the vision for gitoxide that you've laid out above and the analogy with Commit and ObjectId, I think that options 2 and 3 would both fit well within the wider vision.

What do you think? Which one should I implement?

@Byron
Copy link
Member

Byron commented Apr 13, 2025

Thanks for the analysis and for deriving options! Some of these I implicitly discarded, but it's great to see them written out.

Bringing up the sign field in gix_date::Time is great because I think it's a mistake, created in a time when the purpose of *Ref and the meaning for round-tripping weren't sketched out yet. The idea is that the fully-owned versions would always serialize to a 'standard' version of an object, whereas buffer-backed decoded objects should round-trip. Right now, this most definitely isn't the case for some variants of name/emails (and there are tests for these which are sourced from the real world), but dealing with this can be done once there is demand.

Thus, let's go ahead with Option 3. I'd still expect that SignatureRef::time() is still infallible if the object parsed in the first place, but it seems there is no way to know since a user could set myref.time = "foo" and call time() on it. I will go with your intuition for that one, which should also be relevant for SignatureRef::to_owned().

Lastly, if you wanted to truly gift gitoxide, I'd love it if Time::sign could go away in favor of round-tripping through SignatureRef - I am sure Git has a choice on the sign it uses for 0 and the Time serialization would just stick to that.

Thanks again, it's my pleasure to work with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants