Skip to content

Commit 167790c

Browse files
authored
ngram tokenizer split on whitespace (#74)
1 parent 203ee2e commit 167790c

File tree

1 file changed

+1
-4
lines changed

1 file changed

+1
-4
lines changed

analyzers/ngrams/main.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -102,10 +102,7 @@ def get_ngram_rows(ngrams_by_id: dict[str, int]):
102102

103103
def tokenize(input: str) -> list[str]:
104104
"""Generate words from input string."""
105-
106-
output = re.split(r"\W+", input.lower())
107-
output = [value for value in output if "http" not in value]
108-
return output
105+
return re.split(" +", input.lower())
109106

110107

111108
def ngrams(tokens: list[str], min: int, max: int):

0 commit comments

Comments
 (0)