You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to perform a lot of retokenization before running a training pipeline, but from the doc I cannot understand if that is possible and, if yes, how to specify that in the config file.
In #5921 , @svlandeg showed how to deal with a similar issue, in the case at hand (i.e. training NER), without adding a custom component; thus, she didn't explicitly answer the original question and my question.
As I explained in issue #13248 and, more extensively, in discussion #7146 , I'm struggling to develop a viable tokenizer for the Arabic language. For doing that, I think I need both to extend the data (the configuration files) in the current implementation of the tokenizer and to add a considerable amount of post-processing.
In the past, I've implemented the post-processing with some Cython code and I began to get significantly improved results from the data debug and train commands. Then, I installed spaCy from source, but in this case I wasn't able to integrate my Cython code with the spaCy codebase, more precisely to import tokenizer.Tokenizer and vocab.Vocab.
Now, I guess that being able to put a component just after the spaCy Tokenizer in the training pipeline (and in the production pipeline) would be much cleaner and probably more efficient.
Could somebody answer my question and/or suggest a solution for my problem? Thanks in advance!
The text was updated successfully, but these errors were encountered:
I need to perform a lot of retokenization before running a training pipeline, but from the doc I cannot understand if that is possible and, if yes, how to specify that in the config file.
In #5921 , @svlandeg showed how to deal with a similar issue, in the case at hand (i.e. training NER), without adding a custom component; thus, she didn't explicitly answer the original question and my question.
As I explained in issue #13248 and, more extensively, in discussion #7146 , I'm struggling to develop a viable tokenizer for the Arabic language. For doing that, I think I need both to extend the data (the configuration files) in the current implementation of the tokenizer and to add a considerable amount of post-processing.
In the past, I've implemented the post-processing with some Cython code and I began to get significantly improved results from the data debug and train commands. Then, I installed spaCy from source, but in this case I wasn't able to integrate my Cython code with the spaCy codebase, more precisely to import tokenizer.Tokenizer and vocab.Vocab.
Now, I guess that being able to put a component just after the spaCy Tokenizer in the training pipeline (and in the production pipeline) would be much cleaner and probably more efficient.
Could somebody answer my question and/or suggest a solution for my problem? Thanks in advance!
The text was updated successfully, but these errors were encountered: