Crawl PDF files and scaned documents #161

tpolo777 · 2018-04-21T14:35:09Z

I faced with such kind of problem as crawling deep web PDF files and scaned documents in it. For instance, I have a number of PDFs (belong to one science topic ) which may be used as a seed, but I m not sure of the harvest quality. I can turn my PDFs into text files, but nothing about crawling PDFs and using text files as seeds in the project's documentation. Is Ache crawler capable fo deep web science data crawling? IF no, would you advise me some advanced crawlers ?

aecio · 2018-04-27T18:53:43Z

Hi @tpolo777 , apologies for the late response.

ACHE is able to download PDF files when it finds links to them, but it won't necessarily prioritize downloading such files. More specifically, all current ACHE page classifier implementations assume that pages are HTML page and won't extract any text from PDF files, so it has no way identify that a PDF is relevant. Creating a new page classifier that is able to handle PDFs (identify the PDF mime-type and extract its text) would be enough to make it work.

I'm also not aware of any web crawler that crawls PDFs out-of-the-box.

cslovell · 2018-10-31T17:47:46Z

I wonder how easy / hard it would be to pass all content through a Tika parser that always returns html?

aecio · 2018-11-03T21:26:12Z

I believe it shouldn't be very hard.
HTML parsing is done in here: https://github.com/ViDA-NYU/ache/blob/17577ccc9a43121f722843ce914ab02f0538be41/src/main/java/focusedCrawler/crawler/async/FetchedResultHandler.java#L48-L62
For PDFs, we basically would need to detect PDF mime types, try to parse them, and store the parsed data.

aecio added the new-feature label May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl PDF files and scaned documents #161

Crawl PDF files and scaned documents #161

tpolo777 commented Apr 21, 2018 •

edited

Loading

aecio commented Apr 27, 2018

cslovell commented Oct 31, 2018

aecio commented Nov 3, 2018 •

edited

Loading

Crawl PDF files and scaned documents #161

Crawl PDF files and scaned documents #161

Comments

tpolo777 commented Apr 21, 2018 • edited Loading

aecio commented Apr 27, 2018

cslovell commented Oct 31, 2018

aecio commented Nov 3, 2018 • edited Loading

tpolo777 commented Apr 21, 2018 •

edited

Loading

aecio commented Nov 3, 2018 •

edited

Loading