Repo Integrity Mismatch #236

henrikplate · 2023-05-11T09:10:47Z

This check resembles very much what we have attempted a few years back, that is, to compare the (Python) files in a PyPI package with the corresponding files in the source code repo. In more detail, we tried to identify individual lines and checked whether they contain suspicious Python calls.

However, my take-away of our experiments was that there are many differences, which render such checks very noisy.

From the paper LastPyMile: identifying the discrepancy between sources and packages: "Figure 5 shows that 65% of artifacts and 22% of files present in PyPI have changes with respect to the source code repository."

Would it possible to share your feedback on the check's precision?

Cheers, Henrik

PS: You can find the PDF also on Google Scholar.

christophetd · 2023-05-11T10:27:33Z

Hello, thanks for the great question!

We did find the check noisy at first, which is why we only take into account more opinionated use-cases:

Exclude some file extensions https://github.com/DataDog/guarddog/blob/main/guarddog/analyzer/metadata/pypi/repository_integrity_mismatch.py#L133
Only flag files that are on GitHub and in the package tarball but don't have the same hash

@vdeturckheim was the original implementer, in case he wants to give more context. Overall we acknowledge that this check is an heuristic and by no means perfect, but your feedback/thoughts are welcome!

henrikplate · 2023-05-11T15:11:19Z

It would make sense to combine the checks you already have to further reduce noise. For example, you could run your semgrep rules (maybe even more relaxed ones) only on those files that differ between package and repo. Using the line number info from semgrep results, you could filter only those findings that concern code only existing in the package.

christophetd added the question Further information is requested label May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repo Integrity Mismatch #236

Repo Integrity Mismatch #236

henrikplate commented May 11, 2023

christophetd commented May 11, 2023

henrikplate commented May 11, 2023

Repo Integrity Mismatch #236

Repo Integrity Mismatch #236

Comments

henrikplate commented May 11, 2023

christophetd commented May 11, 2023

henrikplate commented May 11, 2023