Positive Tokenization? #13383

dave-richards · 2024-03-15T22:24:04Z

dave-richards
Mar 15, 2024

I am new to NLU and spacy, but I have been reading he docs and doing some testing. I would like to implement a custom tokenizer for Biblical Greek. My reading of the tokenizer docs is that the customizations are "negative", i.e. a token is not a whitespace character and it's not a prefix and its not a suffix and its not an infix. Everything else is a valid token. I would like to work the other way around. I would like to define exactly what is a token and continues down the pipeline and skip over what is not. Is my understanding correct and is it possible to invert the logic to work as I would like?

svlandeg · 2024-03-19T10:43:12Z

svlandeg
Mar 19, 2024
Maintainer

Hi!

Just to be sure: are you aware that we are supporting "Ancient greek" with the language tag grc? Does this not work sufficiently well for you?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Positive Tokenization? #13383

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Positive Tokenization? #13383

dave-richards Mar 15, 2024

Replies: 1 comment

svlandeg Mar 19, 2024 Maintainer

dave-richards
Mar 15, 2024

svlandeg
Mar 19, 2024
Maintainer