Pass additional arguments through Language.__call__/pipe to Language.make_doc #9077
Replies: 1 comment 1 reply
-
There's a definite need for this kind of functionality. We're currently planning to do this slightly differently by letting you just pass
There is also always the option to do this with a custom tokenizer so you can have a single pipeline call instead of multiple steps, but it's much more of a hassle if you want to extend the built-in The draft PR: #9069 |
Beta Was this translation helpful? Give feedback.
-
I've created a
Language
subclass to extend themake_doc
method with some additional logic.make_doc
is normally invoked byLanguage.__call__
andLanguage.pipe
to create aDoc
from text.To accomplish my goals, I need to pass additional information to
make_doc
. However, there isn't a way to do this via__call__
orpipe
.Language.pipe
does have the notion of a context, but this is simply yielded to the caller.I propose that
__call__
andmake_doc
be modified to accept additional args and/or keyword args, that__call__
pass the additional args/kwargs tomake_doc
, and thatpipe
pass the context per text tomake_doc
.Example Code
The current signature for
Language.__call__
and the invocation of make_doc is:I've extended this and
make_doc
in my custom Language subclass to accept and pass additional arguments and keyword arguments.For
Language.pipe
, the context associated with each text could be passed tomake_doc
. I haven't prototyped this because it'd involve a bigger change and there's extra considerations required for the multiprocessing pipeline.Use Case
One of my use cases for this is custom affix handling. I have some text preprocessing that currently strips away affixes before the text goes into a spaCy language pipeline. The affixes are different for each piece of text, so they can't be generalized into an affix pattern for the tokenizer. Here's some made up examples.
Because the affixes are unique (and not nearly as simple as this in the actual data!), I want to pass that to
make_docs
, and store the data on a Doc extension. That way, I can have a custom component read the affix and retokenize the Doc and perform other operations on the affixes.Rationale
The main reason why additional arguments to
make_doc
is needed is that the custom pipeline components need additional information about the text, such as the affix in the above example, on a per text basis. Setting the information on the Doc returned by__call__
/pipe
is too late, because the Doc has already passed through the language pipeline. Themake_doc
method is the ideal place to set additional information because it constructs the Doc that is passed through the language pipeline, hence additional information can be set and the Doc otherwise customized before pipeline components run.At the moment, I have this working in my project. But, it's not ideal because I have to copy the entire implementation of
Language.__call__
just to change the call tomake_doc
.Beta Was this translation helpful? Give feedback.
All reactions