Skip to content

PhraseMatcher Incorrectly Matching String Twice with Single Char Difference #5300

Discussion options

You must be logged in to vote

This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes 'lgmd1g' into two tokens:

['lgmd1', 'g']

When the PhraseMatcher searches the document nlp('lgmd1g'), the first pattern matches the first token (span with document slice [0:1] in the first match) and the second pattern matches a phrase containing two tokens ([0:2] in the second match).

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / matcher Feature: Token, phrase and dependency matcher
2 participants
Converted from issue

This discussion was converted from issue #5300 on December 11, 2020 00:24.