You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This check resembles very much what we have attempted a few years back, that is, to compare the (Python) files in a PyPI package with the corresponding files in the source code repo. In more detail, we tried to identify individual lines and checked whether they contain suspicious Python calls.
However, my take-away of our experiments was that there are many differences, which render such checks very noisy.
Only flag files that are on GitHub and in the package tarball but don't have the same hash
@vdeturckheim was the original implementer, in case he wants to give more context. Overall we acknowledge that this check is an heuristic and by no means perfect, but your feedback/thoughts are welcome!
It would make sense to combine the checks you already have to further reduce noise. For example, you could run your semgrep rules (maybe even more relaxed ones) only on those files that differ between package and repo. Using the line number info from semgrep results, you could filter only those findings that concern code only existing in the package.
This check resembles very much what we have attempted a few years back, that is, to compare the (Python) files in a PyPI package with the corresponding files in the source code repo. In more detail, we tried to identify individual lines and checked whether they contain suspicious Python calls.
However, my take-away of our experiments was that there are many differences, which render such checks very noisy.
From the paper LastPyMile: identifying the discrepancy between sources and packages: "Figure 5 shows that 65% of artifacts and 22% of files present in PyPI have changes with respect to the source code repository."
Would it possible to share your feedback on the check's precision?
Cheers, Henrik
PS: You can find the PDF also on Google Scholar.
The text was updated successfully, but these errors were encountered: