Skip to content

Pipeline Speed for 'et' 15GB = 55 Days #979

Answered by hermanpetrov
hermanpetrov asked this question in Q&A
Discussion options

You must be logged in to vote

I was provided my error recently. I was calling out the nlp pipeline per sentence instead of a chunk.
So it was a for loop that went through single sentence one by one instead of the whole chunk.

the new script runs much faster 15gb = 7.5 days . 1.6 GB = 16 hours.

Reading f.read instead of f.readline improved the situation. I would say 15gb of data is more appropriate.

Furthermore my current data is read with sentences that are separated with \n\n at the end which improved the situation to 37000 it/s which is a massive improvement over 300 it/s speed for 1000 rows of sentences.

with open('Data/data'+str(counter)+'.csv', 'r', encoding='utf-8') as f:
      inputText = f.read().rstrip()
    …

Replies: 6 comments 8 replies

Comment options

You must be logged in to vote
1 reply
@hermanpetrov
Comment options

Comment options

You must be logged in to vote
2 replies
@hermanpetrov
Comment options

@hermanpetrov
Comment options

Comment options

You must be logged in to vote
1 reply
@hermanpetrov
Comment options

Comment options

You must be logged in to vote
2 replies
@hermanpetrov
Comment options

@hermanpetrov
Comment options

Comment options

You must be logged in to vote
2 replies
@hermanpetrov
Comment options

@hermanpetrov
Comment options

Comment options

You must be logged in to vote
0 replies
Answer selected by hermanpetrov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants