Seeking advice on creating a PDF to text extraction pipeline component #7549

SamEdwardes · 2021-03-24T15:58:07Z

SamEdwardes
Mar 24, 2021

I have been playing around with the idea of making a pipeline component that can support extracting text from a PDF. There are a few reasons I would like to do this:

I would like to annotated each token so I know what PDF page it came from
I think it would be cool to be able to call a pipeline that takes a PDF file, extracts, text, and then performs all of the great NLP tasks already built into spaCy.

For reference, here is some pseudo code describing how I think you could use this:

import spacy
from spacypdfreader import PDFreader

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("pdfreader", first=True)

path_to_pdf_file = "~/data/document.pdf"
doc = nlp(path_to_pdf_file)

print(doc[0]._.page_number)
print(doc[999]._.page_number)

1
3

I have a few questions as a starting point:

Is it a good idea to use a pipeline? This breaks the rule/convention that a custom pipeline component should receive a Doc and return a Doc.
If not a pipeline, I can imagine a function that receives a path to a pdf and returns a doc. However how can I then make it extensible so that it can be used with other pipelines? Maybe a function that takes a path to a pdf and an nlp as parameters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking advice on creating a PDF to text extraction pipeline component #7549

{{title}}

Replies: 0 comments

Select a reply

Seeking advice on creating a PDF to text extraction pipeline component #7549

SamEdwardes Mar 24, 2021

Replies: 0 comments

SamEdwardes
Mar 24, 2021