PhraseMatcher Incorrectly Matching String Twice with Single Char Difference #5300
-
I am creating a PhraseMatcher with different rules. I came across an instance where rules with different matching patters matched the same string. I am wondering if it's a hashing collision? How to reproduce the behaviournlp = spacy.load("en_core_web_md", vectors=False)
matcher = spacy.matcher.PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc('lgmd1')]
matcher.add(str('0'), None, *patterns)
patterns = [nlp.make_doc('lgmd1g')]
matcher.add(str('1'), None, *patterns)
print(matcher(nlp('lgmd1g'))) Produces: [(746762829127501960, 0, 1), (5533571732986600803, 0, 2)] Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes
When the |
Beta Was this translation helpful? Give feedback.
This is correctly showing one match for each pattern. I think the confusion is related to the tokenization. The default English tokenizer tokenizes
'lgmd1g'
into two tokens:When the
PhraseMatcher
searches the documentnlp('lgmd1g')
, the first pattern matches the first token (span with document slice[0:1]
in the first match) and the second pattern matches a phrase containing two tokens ([0:2]
in the second match).