Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Takes too long to parse doc results #18

Open
Joselinejamy opened this issue Aug 12, 2019 · 4 comments
Open

Takes too long to parse doc results #18

Joselinejamy opened this issue Aug 12, 2019 · 4 comments

Comments

@Joselinejamy
Copy link

Hello,
It takes too long to parse the doc object, i.e to iterate over sentence and tokens in them. Is that expected ?

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

for line in lines:
    doc = nlp.pipe([line])

The above code takes few milliseconds (apart from initialisation) to run over 500 sentences,

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

for line in lines:
    doc = nlp.pipe([line])
    token_details = []

    for sents in doc:
        for tok in sents:
            token_details.append([tok.text, tok.lemma_, tok.pos_])

while this takes almost a minute(apart from initialisation) to run over 500 sentences

P.S : Have put nlp.pipe() inside a for loop intentionally to get all tokens for one sentence even though it gets segmented.

@honnibal
Copy link
Member

honnibal commented Aug 12, 2019

@Joselinejamy nlp.pipe() is a generator, so you're not actually executing the parser in the first block. I think that's why it seems faster: it's not actually doing the work. To make sure the parse is completed, you'll need something like:

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

word_count = 0
for doc in nlp.pipe(lines):
    word_count += len(doc)
print(word_count)

The main efficiency problem we have at the moment is that we don't have support for batching the predictions and returning a Doc object per item. We'd gladly accept a PR for this.

@Joselinejamy
Copy link
Author

@honnibal Thank you for that instant response. But when i ran the below code with just spacy's model it took relatively less time around jus 5sec.

import spacy
start = time.time()
spacy_nlp = spacy.load('en')
for line in lines:
    doc = spacy_nlp.pipe([line])
    token_details = []

    for sent in doc:
        for tok in sent:
            token_details.append([tok.text, tok.lemma_, tok.pos_])

print("Time taken : %f " % (time.time() - start))

As per the documentation,

If language data for the given language is available in spaCy, the respective language class will
be used as the base for the nlp object – for example, English()

So when the same English object is used why is it taking much time ?. Or is my understanding diverged from what is intended ?

@diegollarrull
Copy link

Hi, I'm also seeing a drastic performance decrease when using stanza. For a comparison, here's a project I'm working on, where I'm running a different number of parsers on over 6000 sentences. It can be seen that running CoreNLP 3 + CoreNLP 4 + spaCy roughly takes 8 times less time than running CoreNLP 3 + CoreNL4 + Stanza trough spacy_stanza.

Screen Shot 2020-06-20 at 20 02 12

Could this be GPU related as well ? These tests are run on a CPU, not GPU.

@adrianeboyd
Copy link
Contributor

The stanza models are just much slower than the typical spacy core models. spacy-stanza is just a wrapper that hooks stanza into the tokenizer part of the spacy pipeline, so it looks like the pipeline components are the same as in a plain English() model, but underneath the tokenizers are different. You can see:

import spacy
import stanza
import spacy_stanza
from spacy_stanza import StanzaLanguage

snlp = stanza.Pipeline(lang="en")
nlp_stanza = StanzaLanguage(snlp)

nlp_spacy = spacy.blank("en") # equivalent to English()

# both are the same type of Language pipeline
assert isinstance(nlp_stanza, spacy.language.Language)
assert isinstance(nlp_spacy, spacy.language.Language)

# both [] (no components beyond a tokenizer)
assert nlp_stanza.pipe_names == nlp_spacy.pipe_names

# however the tokenizers are completely different, and the
# spacy_stanza "tokenizer" is doing all the time-consuming stanza processing
assert isinstance(nlp_stanza.tokenizer, spacy_stanza.language.Tokenizer)
assert isinstance(nlp_spacy.tokenizer, spacy.tokenizer.Tokenizer)

And as Matt said above, there's no good batching solution for stanza at the moment, so the speed difference between nlp_spacy.pipe() and the spacy-stanza pipeline is going to be even higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants