Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex stable_hash fix #78

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rasmushenningsson
Copy link
Contributor

Description

Regex is a mutable struct (for technical reasons), with a Ptr to a compiled regex, which for hash versions < 4 caused Regex objects to hash differently each time.

It seems to be solved for hash version 4, and this PR adds unit tests to guard against regressions.

See also discussion in #66.

Benchmarks

Not applicable.

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.27%. Comparing base (25d4636) to head (61b5b6e).

❗ There is a different number of reports uploaded between BASE (25d4636) and HEAD (61b5b6e). Click for more details.

HEAD has 11 uploads less than BASE
Flag BASE (25d4636) HEAD (61b5b6e)
12 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #78       +/-   ##
===========================================
- Coverage   93.61%   70.27%   -23.35%     
===========================================
  Files           7        7               
  Lines         673      666        -7     
===========================================
- Hits          630      468      -162     
- Misses         43      198      +155     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@haberdashPI haberdashPI self-requested a review October 25, 2024 15:11
Copy link
Member

@haberdashPI haberdashPI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

🤔 I'm a little worried these tests could still pass even if the internal ptr data gets hashed (e.g. if it just so happens the implementation or regex has changed recently, such that in this situation the ptr has the same value).

Have you run the tests for HashVerion <= 3 to verify that they're broken? I think it would be good to include some @test_broken tests for those versions.

@rasmushenningsson
Copy link
Contributor Author

I've added tests for versions <= 3.

I've also added a test that the Ptr is indeed different. So subsequent tests can work under the assumption that the pointers are different. I couldn't think of any better way of doing it.

Also added tests for Regex options for completeness.

@haberdashPI
Copy link
Member

haberdashPI commented Oct 30, 2024

This looks good!

It looks like there are some issues with a recent Pluto verison, unrelated to your PR. I want to make sure that gets fixed and verify that all is well with older versions of Julia. (For StableHashTraits.jl > 2.0 and greater I think it makes to support only >= 1.10, but I don't want to break old hash versions if I don't have to). I can look into that and make the updates in the next day or so.

@rasmushenningsson
Copy link
Contributor Author

There are also some issues with my Regex tests for Julia <= 1.9. I'll look into those.

@rasmushenningsson
Copy link
Contributor Author

The failures are actually due to Regex being a StructTypes.StringType and that the printing of Regex changed between Julia versions 1.9 and 1.10.
See #66 for a more detailed description.

Add unit tests for Regex (HashVersion{4} only).
More explicit tests to make it easier to understand how the Regex internals affect the hash.
This is done to get stable Regex hashes across Julia versions.

Update unit test reference hashes.
@rasmushenningsson rasmushenningsson changed the title Regex unit tests Regex stable_hash fix Nov 13, 2024
@rasmushenningsson
Copy link
Contributor Author

rasmushenningsson commented Nov 13, 2024

I've updated this PR to include a transformer for Regex.
This is breaking in the sense that Regexes now hashes differently, but it can be argued that the old behavior was a bug.

Currently tests on 1.6 do not pass.
In Julia 1.6, the default regex compile flags were set differently. Unfortunately, using the regex compile flags from 1.7- gives an error message on 1.6 when trying constructing the regex object:

julia> Regex("regex", Base.PCRE.UTF | Base.PCRE.MATCH_INVALID_UTF | Base.PCRE.ALT_BSUX | Base.PCRE.UCP, Base.DEFAULT_MATCH_OPTS)
ERROR: ArgumentError: invalid regex compile options: 67764226

I think it's OK that the hash is different on 1.6. The regex compile flags are different after all, so in theory we could get different matches when using the regex. Do you agree?

If you do agree, I can disable the tests on Julia 1.6. I don't know how to mark them as broken, since it's using the @test_reference macro.

Copy link
Member

@haberdashPI haberdashPI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK that the hash is different on 1.6. The regex compile flags are different after all, so in theory we could get different matches when using the regex. Do you agree?

I think making this work for 1.6 shouldn't be too bad. We just want something like this, right?

hash_compile_options = re.compile_options & (PRCRE.CASELESS | PCRE.MULTILINE | PCRE.DOTALL | PRCE.EXTENDED)

hash_match_options = re.match_options & ~PCRE.NO_UTF_CHECK

There are some risks with this approach: it can still be broken by future changes to the Regex implementation. But we already run that risk if e.g. a field name of the Regex struct changes.

##### Regex
#####

transform_type(::Type{Regex}) = "Base.Regex"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this transform is necessary. The default transform_type would hash "Regex". In general we are trying to avoid the use of module prefixes in the type hash, because the parent module is often considered an implementation detail and can change in non-breaking releases. Might as well stay consistent here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants