Add Submodules Fuzz Target #1919

DaveLak · 2024-05-30T03:08:16Z

Fuzz Introspector heuristics suggest the Submodule API code represent "optimal analysis targets" that should yield a meaningful increase in code coverage. The changes here introduce a first pass at implementing a fuzz harness that cover the primary APIs/methods related to Submodules. Of particular interest to me is the Submodule.config_writer() coverage.

Please note however, there is likely plenty of room for improvement in this harness in terms of both code coverage as well as performance; the latter of which will see significant benefit from a well curated seed corpus of .gitmodules file like inputs. The ParsingError raised by the fuzzer without a good seed corpus hinders test efficacy significantly.

I have a draft PR up with a seed corpus here: gitpython-developers/qa-assets#5

Fuzz Introspector heuristics suggest the Submodule API code represent "optimal analysis targets" that should yield a meaningful increase in code coverage. The changes here introduce a first pass at implementing a fuzz harness that cover the primary APIs/methods related to Submodules. Of particular interest to me is the `Submodule.config_writer()` coverage. Please note however, there is likely plenty of room for improvement in this harness in terms of both code coverage as well as performance; the latter of which will see significant benefit from a well curated seed corpus of `.gitmodules` file like inputs. The `ParsingError` raised by the fuzzer without a good seed corpus hinders test efficacy significantly.

Byron

Thanks a lot for taking such great care!

Part of me thinks that the submodule implementation is so riddled with inaccuracies and and incorrectness that fuzzing it seems like a waste. The fuzzer can only try to find unexpected exceptions, and maybe that's a small win, but at what cost?

Part of that feeling also stems for the incredible sluggishness of Python in general, so any fuzzing feels wasteful. But that's besides the point I suppose, apologies for the ramblings.

DaveLak · 2024-05-30T20:04:23Z

Please don't apologize, and definitely do not hesitate to reject or push back on any of my PRs! (especially considering that my last few PRs came out of the blue without prior discussion about whether they're even wanted -- sorry about that 😅)

Part of me thinks that the submodule implementation is so riddled with inaccuracies and and incorrectness that fuzzing it seems like a waste. The fuzzer can only try to find unexpected exceptions, and maybe that's a small win, but at what cost?
Part of that feeling also stems for the incredible sluggishness of Python in general, so any fuzzing feels wasteful.

I think your points are perfectly reasonable. Here is how I've been thinking of the value in fuzzing GitPython:

Hopelessly broken or not, GitPython is widely used, so I tend to believe (perhaps too optimistically) that even small wins which result in improved stability can have an outsized impact for some n of users over time. Doing the wrong thing right is better than doing it wrong + unexpected behavior someone will eventually need to debug. That said, I fully respect and understand if you don't think the value justifies the CPU cycles, maintenance burden, or any other reason for that matter.
Similar to the above, I believe continuous fuzzing is well-positioned to identify regressions that traditional unit tests may miss. Even without a CI action run on PRs, the accumulated corpus in ClusterFuzz should be able to identify unexpected exceptions reasonably quickly after they're introduced. But, of course, that is just my hypothesis, which has yet to be validated.
Finally, perhaps from a somewhat naive perspective, I believe that incorporating well-documented and effective fuzzing into GitPython has the potential to benefit the wider Python community. Now, I fully recognize how presumptuous that sounds but here me out. Historically, fuzzing has been used by security experts on lower-level languages with great success, but a lack of easy to use tooling, terse documentation, and few quality examples to emulate has hindered wider adoption in higher-level languages like Python. So I think integrating fuzz testing into a widely-used and complex Python project like GitPython, paired with some quality documentation, can lower the barrier for entry for folks that may be seeking to learn more or just happen to stumble across it (like I did lol.)

But that's besides the point I suppose, apologies for the ramblings.

I think everything you said is very much on-point regarding any of the fuzzing work in this repo. Moreover, I really appreciate hearing your thoughts, so thanks!

In case it isn't clear, I won't be offended if you feel the juice isn't worth the squeeze, and would rather me hold off on any non-maintenance type fuzzing work. Frankly, if you decided you'd rather it all removed ASAP, I'd help remove it. I've learned a lot about Git, Python, fuzzing, and more working on these, so I wouldn't consider it a wasted effort even if the changes never made it to PR, So thanks, @Byron, for the support along the way! 🙂

And now, it's my turn to apologize for the ramblings 😅

Byron · 2024-05-31T05:16:54Z

(especially considering that my last few PRs came out of the blue without prior discussion about whether they're even wanted -- sorry about that 😅)

That's perfectly alright - no need to make it more complicated, just do what you think is right, you are driving this.

That said, I fully respect and understand if you don't think the value justifies the CPU cycles, maintenance burden, or any other reason for that matter.

I am also clearly biased and think that everybody should use gitoxide, so it helps to see your reasoning to unbias me a bit. After all, GitPython has it's value and given its usage, maybe it's probably a good idea to invest in any measure that can make it a little better, maybe particularly due to its many flaws.

Finally, perhaps from a somewhat naive perspective, I believe that incorporating well-documented and effective fuzzing into GitPython has the potential to benefit the wider Python community.

Spreading fuzzing as a technique through GitPython is a great thought and I am fully behind that - if nothing else comes out of it, more Python projects might adopt it which could be a net-win. And even if not, people learn how to use a fuzzer which will help in any programming environment eventually.

In case it isn't clear, I won't be offended if you feel the juice isn't worth the squeeze, and would rather me hold off on any non-maintenance type fuzzing work. Frankly, if you decided you'd rather it all removed ASAP, I'd help remove it. I've learned a lot about Git, Python, fuzzing, and more working on these, so I wouldn't consider it a wasted effort even if the changes never made it to PR, So thanks, @Byron, for the support along the way! 🙂

I'd never do that, and don't feel that way at all. But I do admit that I'd love to see you eventually move to gitoxide - you do tremendous work here and even if it's just for spreading good fuzzing setups, gitoxide could certainly be a good vessel for that.

Eventually.

No pressure :D.

PS: There I'd definitely have more opinions on what to fuzz as well, which might make it more interesting for you.

DaveLak force-pushed the improve-fuzzer-coverage branch from fa52ea2 to 6d52bdb Compare May 30, 2024 05:08

DaveLak mentioned this pull request May 30, 2024

Add initial corpus and dictionary for the fuzz_submodule target gitpython-developers/qa-assets#5

Merged

Byron approved these changes May 30, 2024

View reviewed changes

Byron merged commit 2345c1a into gitpython-developers:main May 30, 2024
26 checks passed

DaveLak deleted the improve-fuzzer-coverage branch May 30, 2024 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Submodules Fuzz Target #1919

Add Submodules Fuzz Target #1919

DaveLak commented May 30, 2024

Byron left a comment

DaveLak commented May 30, 2024

Byron commented May 31, 2024

Add Submodules Fuzz Target #1919

Add Submodules Fuzz Target #1919

Conversation

DaveLak commented May 30, 2024

Byron left a comment

Choose a reason for hiding this comment

DaveLak commented May 30, 2024

Byron commented May 31, 2024