Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick Question #389

Closed
kd10041 opened this issue Jun 26, 2024 · 4 comments
Closed

Quick Question #389

kd10041 opened this issue Jun 26, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@kd10041
Copy link

kd10041 commented Jun 26, 2024

Where can I see the implementation of .partial_ratio() ? Can you let me know the logic which is utilized for this method.
Thanks in advance!

@kd10041
Copy link
Author

kd10041 commented Jun 26, 2024

I am new to python!
So I went through the codebase of rapidfuzz. So basically the implementation of partial is divided to 2 parts

1. short needle implementation length<=64

Here a sliding window of length=min(len(s1),len(s2)) is used. and fuzz.ratio() is calculated on the all the alignments possible.
for example: I took this example from this issue

s1='real'
s2='barcelona'
fuzz.ratio(s1,s2)

So the best alignment here is window size is equals to 4 here.

r_eal
rce_l 

This requires two operations output is 1-2/(4+4) = 75 which is the exact output as given by rapidfuzz.

2. long needle implementation length>64

This is similar to as implemented by fuzzywuzzy. The logic here is find the best alignment from shorter string to the Longest common substring of longer string. and find similarity score using fuzz.ratio() from the needle to longest common substring.

Can anyone give example clear this part up?
Any help would be appreciated regarding this.

@maxbachmann
Copy link
Member

Oh the documentation is simply outdated. In the past I did use two implementations since I didn't have a way to make the implementation for long needles reasonably fast. However this did mean that the implementation for longer needles was similar to whats done in fuzzywuzzy, which doesn't always provide the correct results.

I have since found a better way to filter out impossible results and so I use the "correct" implementation both for short and long needles. You will still notice a drop in performance once the needle has more than 64 characters though.

From a user perspective it's simply a sliding window where the substring taken from the longer string has a length of.

  • <= the length of the needle if it starts/ends at the start/end of the longer string
  • the length of the needle if it's somewhere in the middle of the longer string

The pure Python fallback implementation is:

def _partial_ratio_short_needle(s1, s2, score_cutoff):

The C++ implementation is https://github.com/rapidfuzz/rapidfuzz-cpp/blob/10426d24cd7479df0fe8c78b17877e756e1c3cd5/rapidfuzz/fuzz_impl.hpp#L68

The actual implementation doesn't actually check all alignments since it can use knowledge about the maximum distance change per shift of the sliding window to filter out some comparisons.

@kd10041
Copy link
Author

kd10041 commented Jun 27, 2024

Thank you @maxbachmann for clear explanation.
Do you plan on updating the documentation anytime soon?

@maxbachmann maxbachmann added documentation Improvements or additions to documentation question Further information is requested labels Jun 27, 2024
@maxbachmann
Copy link
Member

Yes I will probably fix the docs at some point this week

@kd10041 kd10041 closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants