Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

token_set_ratio Degenerate Case #325

Open
rogerrohrbach opened this issue Oct 13, 2021 · 0 comments
Open

token_set_ratio Degenerate Case #325

rogerrohrbach opened this issue Oct 13, 2021 · 0 comments

Comments

@rogerrohrbach
Copy link

Referring to the description of token_set_ratio in the original blog post: if the SORTED_INTERSECTION is a strict subset of STRING2, the result ratio will be 100. E.g.,

fuzz.token_set_ratio("Deep Learning", "Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2")

yields 100. This is patently incorrect, and does not uphold the purported intuition ("because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar").

Looking at fuzz._token_set, we see that it returns

max(
    [
        ratio_func(sorted_sect, combined_1to2),
        ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
)

It appears the assumption is that the string remainder will never be empty. Perhaps something like this is more appropriate:

max(
    [
        0 if sorted_sect == combined_1to2 else ratio_func(sorted_sect, combined_1to2),
        0 if sorted_sect == combined_2to1 else ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant