Skip to content

Conversation

@iamsims
Copy link
Collaborator

@iamsims iamsims commented Aug 5, 2025

  • Config use_fuzzy_match and fuzzy_match_threshold

Fuzzy match uses fuzzy token set ratio, in default fuzzy logic is set to false and the fuzzy threshold for match is 87.

- Config use_fuzzy_match and fuzzy_match_threshold
@NISH1001
Copy link
Collaborator

NISH1001 commented Aug 6, 2025

@iamsims please do follow the branch naming convention we have such as feature/ bugfix/ hotfix/ refactor/ etc

Thanks.

Comment on lines 348 to 355

# if (
# whitelisted_title in source_title
# or source_title in whitelisted_title
# ):
# # Check if it's a meaningful match (not just common words)
# if len(whitelisted_title) > 10 or len(source_title) > 10:
# return True, category_name, 0.8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this remove older logic no?

Instead of having destructive change, can we have constructive change. That is: first check if fuzzy is enabled. If not enabled, then use the older logic. Don't replace current logic because not sure fuzzy will work 100% of the time.

I'd suggest something like

if self.config.use_fuzzy_match:
  ....<your code>
  ...return

continue with whatever logic we have previously. This should fix this comment.



if self.config.use_fuzzy_match:
fuzzy_score_set = token_set_ratio(source_title, whitelisted_title)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the fuzzy match function name as configurable? That is fuzzy_fn_name or something and get the function based on that name. It will give more configurability on what fuzzy matching algo to apply

@NISH1001
Copy link
Collaborator

NISH1001 commented Aug 6, 2025

@iamsims ALso let's add source validation test cases as well tests/tools/test_source_validator or something like that.

Thanks.

iamsims added 2 commits August 6, 2025 21:43
- Configure the type of fuzzy match function
- Fallback to the original algorithm for matching if fuzzy match is disabled
@iamsims iamsims requested a review from NISH1001 August 7, 2025 02:45
@NISH1001
Copy link
Collaborator

NISH1001 commented Aug 7, 2025

@iamsims What's the test coverage now?

Could you run python -m pytest --cov=akd --cov-report=term-missing tests/ and post the results here of the coverage? Thanks

Comment on lines 32 to 34
token_set = "token_set"
token_sort = "token_sort"
ratio = "ratio"
Copy link
Collaborator

@NISH1001 NISH1001 Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have enum all caps lock like TOKEN_SET etc....

TOKEN_SET = "token_set"
...

Comment on lines 333 to 337
scorer_map = {
"token_set": token_set_ratio,
"token_sort": token_sort_ratio,
"ratio": ratio,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could be class-level attribute?

class SourceValidator(...):
  ...
  _scorer_map = dict(token_set = token_set_ratio, ...)
  ...
  def __init__(self, ...):
    ...

- Change the cases of enum constants to CAPS
- Make class variable of scorer_map instead of function variable
@NISH1001
Copy link
Collaborator

NISH1001 commented Sep 3, 2025

@iamsims is this PR still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants