You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I faced with such kind of problem as crawling deep web PDF files and scaned documents in it. For instance, I have a number of PDFs (belong to one science topic ) which may be used as a seed, but I m not sure of the harvest quality. I can turn my PDFs into text files, but nothing about crawling PDFs and using text files as seeds in the project's documentation. Is Ache crawler capable fo deep web science data crawling? IF no, would you advise me some advanced crawlers ?
The text was updated successfully, but these errors were encountered:
ACHE is able to download PDF files when it finds links to them, but it won't necessarily prioritize downloading such files. More specifically, all current ACHE page classifier implementations assume that pages are HTML page and won't extract any text from PDF files, so it has no way identify that a PDF is relevant. Creating a new page classifier that is able to handle PDFs (identify the PDF mime-type and extract its text) would be enough to make it work.
I'm also not aware of any web crawler that crawls PDFs out-of-the-box.
I faced with such kind of problem as crawling deep web PDF files and scaned documents in it. For instance, I have a number of PDFs (belong to one science topic ) which may be used as a seed, but I m not sure of the harvest quality. I can turn my PDFs into text files, but nothing about crawling PDFs and using text files as seeds in the project's documentation. Is Ache crawler capable fo deep web science data crawling? IF no, would you advise me some advanced crawlers ?
The text was updated successfully, but these errors were encountered: