Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more checks for the validity of refnames #1672

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 48 additions & 2 deletions git/refs/symbolic.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,15 +161,61 @@ def dereference_recursive(cls, repo: "Repo", ref_path: Union[PathLike, None]) ->
return hexsha
# END recursive dereferencing

@staticmethod
def _check_ref_name_valid(ref_path: PathLike) -> None:
# Based on the rules described in https://git-scm.com/docs/git-check-ref-format/#_description
previous: Union[str, None] = None
one_before_previous: Union[str, None] = None
for c in str(ref_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid converting to str? I assume this tries to decode ref_path with the current string encoding, which changes depending on the interpreter or user configuration and generally causes a lot of trouble.

Copy link
Contributor

@EliahKagan EliahKagan Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless this PR worsens that problem in some way, which I believe it does not, I would recommend it be fixed separately and later. The code this is replacing already had:

if ".." in str(ref_path):
raise ValueError(f"Invalid reference '{ref_path}'")

But actually even that neither introduced nor exacerbated the problem. From the commit prior to #1644 being merged:

@classmethod
def _get_ref_info_helper(
cls, repo: "Repo", ref_path: Union[PathLike, None]
) -> Union[Tuple[str, None], Tuple[None, str]]:
"""Return: (str(sha), str(target_ref_path)) if available, the sha the file at
rela_path points to, or None. target_ref_path is the reference we
point to, or None"""
tokens: Union[None, List[str], Tuple[str, str]] = None
repodir = _git_dir(repo, ref_path)
try:
with open(os.path.join(repodir, str(ref_path)), "rt", encoding="UTF-8") as fp:

Note how str(ref_path) was passed to os.path.join, which when given strs returns a str, thus a str was being passed to open. Note also that, while this str call was actually redundant (os.path.join accepts path-like objects since Python 3.6), even it was not the cause of str and not bytes being used. The annotation on ref_path is Union[PathLike, None], where PathLike is:

PathLike = Union[str, "os.PathLike[str]"]

Where both alternatives--str and os.PathLike[str]--represent text that has already been decoded.

So unless I'm missing something--which I admit I could be--I don't think it makes conceptual sense to do anything about that in this pull request. Furthermore, unless the judgment that CVE-2023-41040 was a security vulnerability was mistaken, or something about the variation explicated in #1644 (comment) is less exploitable, it seems to me that this pull request is fixing a vulnerability. Assuming that is the case, then I think this should avoid attempting to make far-reaching changes beyond those that pertain to the vulnerability, and that although reviewing these changes for correctness should not be rushed, other kinds of delays should be mostly avoided. With good regression tests included, as seems to be the case, the code could be improved on later in other ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the thorough assessment, I wholeheartedly agree.

The 'how to handle paths correctly' issue is definitely one of the big breaking points in GitPython, but maybe, for other reasons, this wasn't ever a problem here.

Knowing this is on your radar, maybe one day there will be a solution to it. gitoxide already solves this problem, but it's easier when you have an actual type system and a standard library that makes you aware every step of the way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ultimate goal to support both str-based and bytes-based ref names and paths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is correctness, and it's vital that one doesn't try to decode paths to fit some interpreter-controlled encoding. Paths are paths, and if you are lucky, they can be turned into bytes. On Unix, that's always possible and a no-op, but on windows it may require a conversion. It's just the question how these things are supposed to work in python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this relate (conceptually, I mean) to the issue in rust-lang/rust#12056?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great find :) - yes, that's absolutely related. gitoxide internally handles git-paths as bundles of bytes without known encoding, and just like git, it assumes at least ASCII. Conversions do happen but they all go through gix-path to have a central place for it.

Doing something like it would be needed here as well, even though I argue that before that happens universally, there should be some clear definition of what GitPython is supposed to be.

When I took it over by contributing massively, just like you do now, I needed more control for the use-case I had in mind, and started implementing all these sloppy pure-python components that don't even get the basics right. With that I turned GitPython into some strange hybrid which I think didn't do it any good besides maybe being a little faster for some usecases. After all, manipulating an index in memory has advantages, but there are also other ways to do it while relying on git entirely.

Maybe this is thinking a step too far, but I strongly believe that the true benefit of GitPython is to be able to call git in a simple manner and to be compliant naturally due to using git directly. This should be its identity.

But then again, it's worth recognizing that changing the various pure-python implementations to us git under the hood probably isn't possible in a non-breaking way.

Another avenue would be to try and get the APIs to use types that don't suffer from encoding/decoding issues related to Paths, and then one day make the jump to replacing the pure-python implementations with the python bindings of gitoxide.

if c in " ~^:?*[\\":
raise ValueError(
f"Invalid reference '{ref_path}': references cannot contain spaces, tildes (~), carets (^),"
f" colons (:), question marks (?), asterisks (*), open brackets ([) or backslashes (\\)"
)
elif c == ".":
if previous is None or previous == "/":
raise ValueError(
f"Invalid reference '{ref_path}': references cannot start with a period (.) or contain '/.'"
)
elif previous == ".":
raise ValueError(f"Invalid reference '{ref_path}': references cannot contain '..'")
elif c == "/":
if previous == "/":
raise ValueError(f"Invalid reference '{ref_path}': references cannot contain '//'")
elif previous is None:
raise ValueError(
f"Invalid reference '{ref_path}': references cannot start with forward slashes '/'"
)
elif c == "{" and previous == "@":
raise ValueError(f"Invalid reference '{ref_path}': references cannot contain '@{{'")
elif ord(c) < 32 or ord(c) == 127:
raise ValueError(f"Invalid reference '{ref_path}': references cannot contain ASCII control characters")

one_before_previous = previous
previous = c

if previous == ".":
raise ValueError(f"Invalid reference '{ref_path}': references cannot end with a period (.)")
elif previous == "/":
raise ValueError(f"Invalid reference '{ref_path}': references cannot end with a forward slash (/)")
elif previous == "@" and one_before_previous is None:
raise ValueError(f"Invalid reference '{ref_path}': references cannot be '@'")
elif any([component.endswith(".lock") for component in str(ref_path).split("/")]):
raise ValueError(
f"Invalid reference '{ref_path}': references cannot have slash-separated components that end with"
f" '.lock'"
)

@classmethod
def _get_ref_info_helper(
cls, repo: "Repo", ref_path: Union[PathLike, None]
) -> Union[Tuple[str, None], Tuple[None, str]]:
"""Return: (str(sha), str(target_ref_path)) if available, the sha the file at
rela_path points to, or None. target_ref_path is the reference we
point to, or None"""
if ".." in str(ref_path):
raise ValueError(f"Invalid reference '{ref_path}'")
if ref_path:
cls._check_ref_name_valid(ref_path)

tokens: Union[None, List[str], Tuple[str, str]] = None
repodir = _git_dir(repo, ref_path)
try:
Expand Down
36 changes: 36 additions & 0 deletions test/test_refs.py
Original file line number Diff line number Diff line change
Expand Up @@ -631,3 +631,39 @@ def test_refs_outside_repo(self):
ref_file.flush()
ref_file_name = Path(ref_file.name).name
self.assertRaises(BadName, self.rorepo.commit, f"../../{ref_file_name}")

def test_validity_ref_names(self):
check_ref = SymbolicReference._check_ref_name_valid
# Based on the rules specified in https://git-scm.com/docs/git-check-ref-format/#_description
# Rule 1
self.assertRaises(ValueError, check_ref, ".ref/begins/with/dot")
self.assertRaises(ValueError, check_ref, "ref/component/.begins/with/dot")
self.assertRaises(ValueError, check_ref, "ref/ends/with/a.lock")
self.assertRaises(ValueError, check_ref, "ref/component/ends.lock/with/period_lock")
# Rule 2
check_ref("valid_one_level_refname")
# Rule 3
self.assertRaises(ValueError, check_ref, "ref/contains/../double/period")
# Rule 4
for c in " ~^:":
self.assertRaises(ValueError, check_ref, f"ref/contains/invalid{c}/character")
for code in range(0, 32):
self.assertRaises(ValueError, check_ref, f"ref/contains/invalid{chr(code)}/ASCII/control_character")
self.assertRaises(ValueError, check_ref, f"ref/contains/invalid{chr(127)}/ASCII/control_character")
# Rule 5
for c in "*?[":
self.assertRaises(ValueError, check_ref, f"ref/contains/invalid{c}/character")
# Rule 6
self.assertRaises(ValueError, check_ref, "/ref/begins/with/slash")
self.assertRaises(ValueError, check_ref, "ref/ends/with/slash/")
self.assertRaises(ValueError, check_ref, "ref/contains//double/slash/")
# Rule 7
self.assertRaises(ValueError, check_ref, "ref/ends/with/dot.")
# Rule 8
self.assertRaises(ValueError, check_ref, "ref/contains@{/at_brace")
# Rule 9
self.assertRaises(ValueError, check_ref, "@")
# Rule 10
self.assertRaises(ValueError, check_ref, "ref/contain\\s/backslash")
# Valid reference name should not raise
check_ref("valid/ref/name")
Loading