Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl PDF files and scaned documents #161

Open
tpolo777 opened this issue Apr 21, 2018 · 3 comments
Open

Crawl PDF files and scaned documents #161

tpolo777 opened this issue Apr 21, 2018 · 3 comments

Comments

@tpolo777
Copy link

tpolo777 commented Apr 21, 2018

I faced with such kind of problem as crawling deep web PDF files and scaned documents in it. For instance, I have a number of PDFs (belong to one science topic ) which may be used as a seed, but I m not sure of the harvest quality. I can turn my PDFs into text files, but nothing about crawling PDFs and using text files as seeds in the project's documentation. Is Ache crawler capable fo deep web science data crawling? IF no, would you advise me some advanced crawlers ?

@aecio
Copy link
Member

aecio commented Apr 27, 2018

Hi @tpolo777 , apologies for the late response.

ACHE is able to download PDF files when it finds links to them, but it won't necessarily prioritize downloading such files. More specifically, all current ACHE page classifier implementations assume that pages are HTML page and won't extract any text from PDF files, so it has no way identify that a PDF is relevant. Creating a new page classifier that is able to handle PDFs (identify the PDF mime-type and extract its text) would be enough to make it work.

I'm also not aware of any web crawler that crawls PDFs out-of-the-box.

@cslovell
Copy link

I wonder how easy / hard it would be to pass all content through a Tika parser that always returns html?

@aecio
Copy link
Member

aecio commented Nov 3, 2018

I believe it shouldn't be very hard.
HTML parsing is done in here: https://github.com/ViDA-NYU/ache/blob/17577ccc9a43121f722843ce914ab02f0538be41/src/main/java/focusedCrawler/crawler/async/FetchedResultHandler.java#L48-L62
For PDFs, we basically would need to detect PDF mime types, try to parse them, and store the parsed data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants