-
Notifications
You must be signed in to change notification settings - Fork 3
Add fuzzy matching #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add fuzzy matching #103
Conversation
- Config use_fuzzy_match and fuzzy_match_threshold
|
@iamsims please do follow the branch naming convention we have such as Thanks. |
akd/tools/source_validator.py
Outdated
|
|
||
| # if ( | ||
| # whitelisted_title in source_title | ||
| # or source_title in whitelisted_title | ||
| # ): | ||
| # # Check if it's a meaningful match (not just common words) | ||
| # if len(whitelisted_title) > 10 or len(source_title) > 10: | ||
| # return True, category_name, 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this remove older logic no?
Instead of having destructive change, can we have constructive change. That is: first check if fuzzy is enabled. If not enabled, then use the older logic. Don't replace current logic because not sure fuzzy will work 100% of the time.
I'd suggest something like
if self.config.use_fuzzy_match:
....<your code>
...return
continue with whatever logic we have previously. This should fix this comment.
akd/tools/source_validator.py
Outdated
|
|
||
|
|
||
| if self.config.use_fuzzy_match: | ||
| fuzzy_score_set = token_set_ratio(source_title, whitelisted_title) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the fuzzy match function name as configurable? That is fuzzy_fn_name or something and get the function based on that name. It will give more configurability on what fuzzy matching algo to apply
|
@iamsims ALso let's add source validation test cases as well Thanks. |
- Configure the type of fuzzy match function - Fallback to the original algorithm for matching if fuzzy match is disabled
|
@iamsims What's the test coverage now? Could you run |
akd/tools/source_validator.py
Outdated
| token_set = "token_set" | ||
| token_sort = "token_sort" | ||
| ratio = "ratio" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have enum all caps lock like TOKEN_SET etc....
TOKEN_SET = "token_set"
...
akd/tools/source_validator.py
Outdated
| scorer_map = { | ||
| "token_set": token_set_ratio, | ||
| "token_sort": token_sort_ratio, | ||
| "ratio": ratio, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could be class-level attribute?
class SourceValidator(...):
...
_scorer_map = dict(token_set = token_set_ratio, ...)
...
def __init__(self, ...):
...- Change the cases of enum constants to CAPS - Make class variable of scorer_map instead of function variable
|
@iamsims is this PR still relevant? |
Fuzzy match uses fuzzy token set ratio, in default fuzzy logic is set to false and the fuzzy threshold for match is 87.