Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of Tree.cache and TreeModifier is unclear #1771

Open
EliahKagan opened this issue Dec 13, 2023 · 6 comments
Open

Status of Tree.cache and TreeModifier is unclear #1771

EliahKagan opened this issue Dec 13, 2023 · 6 comments

Comments

@EliahKagan
Copy link
Contributor

EliahKagan commented Dec 13, 2023

Note: This bug seems either mitigated or fixed in #1799, and the writeup here does not reflect those changes.

In b295c13, Tree.cache was documented in CHANGES.rst as removed, based on the discovery in #369 that it could not serve its originally intended purpose. But Tree.cache, and its supporting code such as the TreeModifier class, were not removed. They remain part of the public interface, are not indicated as deprecated, and nothing in any docstrings counsels against their use.

I am unsure, based on this, if there is any reasonable use of them. My guess is that they may have been retained so that code that calls them without relying on the effect would not be broken, but I am not sure. If that is the case, it could be documented, DeprecationWarnings could be raised, and some of the non-public code may be possible to remove.

For example, even if TreeModifier.set_done needs to perform the sort, and do so with exactly the current effect, both the git_cmp and merge_sort functions could probably be eliminated entirely, since they were introduced to do something that was cumbersome (at least at the time, supporting Python 2 and 3) with the list.sort method. But this is no longer the case ever since 4f1110b. If I understand, this change made it no longer sort longer sequences first:

-    return len_b - len_a
+    return len_a - len_b
@Byron
Copy link
Member

Byron commented Dec 13, 2023

Thanks for bringing this up.

Since I don't have any knowledge anymore about this implementation except that it is most likely incorrect and should rather not be used for anything serious, what's left to discuss, I presume, is if code should be deprecated so it can eventually be removed.

Maybe it would be worth trying to find it if it is usable for something after all (are there no tests for this?), or one could pre-emptively deprecate it to see if an outcry follows.

In general, I'd always be for removing code, especially if its effectiveness is unclear, but I'd also not want to risk breakage in case we are missing something.

As always, I'd rely on your expertise and analysis, with a strong slant towards following your suggestions.

@EliahKagan
Copy link
Contributor Author

Maybe it would be worth trying to find it if it is usable for something after all (are there no tests for this?)

Currently there are no tests, as they were removed in b295c13, which possibly was also intended to remove the code under test, I am not sure. Some of that test code could potentially be brought back in some capacity (or just experimented with on a feature branch to try and figure out what, if anything, Tree.cache and TreeModifier might be used for). I don't know that they were able to pass: they appear to have been expanded as part of a change made to verify that Tree.cache and TreeModifier couldn't achieve their once hoped-for goal.

This is not necessarily the most important aspect of the situation, but if people are using this, and using it with a lot of data, then a possible benefit of replacing the hand-rolled mergesort implementation with a call to list.sort (which should be feasible without complicated code or big overhead, ever since the behavior of sorting longer names before their prefixes was abandoned) is that list.sort in CPython is Timsort, which is highly adaptive: sorting data that are already almost sorted takes only linear time. The overhead of re-sorting seems to have been a concern in guiding usage:

def set_done(self) -> "TreeModifier":
"""Call this method once you are done modifying the tree information.
This may be called several times, but be aware that each call will cause
a sort operation.
:return self:
"""
merge_sort(self._cache, git_cmp)
return self

Anyway, I don't propose that any such change be rushed into.

@EliahKagan
Copy link
Contributor Author

In b295c13, this was added to the changelog:

  • CRITICAL: Tree.cache was removed without replacement. It is technically impossible to change individual trees and expect their serialization results to be consistent with what git expects. Instead, use the IndexFile facilities to adjust the content of the staging area, and write it out to the respective tree objects using IndexFile.write_tree() instead.

That commit was referenced in closing #369. Recently, work picked back up on this in #1799, which changed the sorting to use a custom key selector function that performs the comparison in a different way. Since then, this issue is out of date, but I am unsure if it is fully fixed.

I think the the point for this issue, and for improving documentation of TreeModifier and what advice should be given about whether and when to use it, is:

It is technically impossible to change individual trees and expect their serialization results to be consistent with what git expects.

What is the reason that was the case, and is it not still the case?

@Byron
Copy link
Member

Byron commented Jan 24, 2024

Unfortunately, I don't know and assume the git commit log also doesn't clarify it. I don't believe it's impossible either, but also don't believe that these facilities should be used - I think they can't be trusted and using them may proliferate subtle or (hopefully) less subtle breakage in git data structures when written to the object database.

If they were to be fixed, I'd probably focus on what GitPython was originally meant to be, that is a wrapper around the git executable, and implement all slightly more complex functionality by invoking git instead. That way it's probably slower particularly for many small trees, but at least it will definitely be correct.

With all that said, it's probably still unclear what to do here, apologies :/.

@ghost
Copy link

ghost commented Mar 15, 2024

I've read the previous codes and docs before 1.0.2 - Fixes
But I'm still wondering what usage you expect originally meant to be in TreeModifier
mayberepo.head.commit.tree[0].cache.add(sha, mode, name)
or repo.head.commit.tree[0].cache.add([file_name]) just like repo.index.add([file_name]) but it's in subtree
can you provide a simple usage you expect originally ?

@Byron
Copy link
Member

Byron commented Mar 15, 2024

I am sorry, my memory has eroded over the last decade. However, the code in question should have tests which might make matters clearer. If there are no tests, then I think it's best to deprecate and discourage usage entirely, as I also think it's not really worth spending time on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants