Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX #2617 Cherio Web Crawler doesn't work with large sites #2678

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ahmosman
Copy link
Contributor

Hi,

There is a fix for issue #2617. There were two problems.

Firstly, sometimes cheerioLoader fails to download contnent of page and it returns undefined, so I added assignment of empty arrays if there is undefined.

Secondly, CheerioWebBaseLoader doesn’t support loading PDF files. It takes a lot to load PDF file and the content is encoded so I believe it shouldn’t be downloaded as Document. So I added condition to avoid downloading PDF files.

I’ve tried it on the same Chatflow as in the issue.

@HenryHengZJ
Copy link
Contributor

do you have an example site that we can test before and after this PR?

@ahmosman
Copy link
Contributor Author

ahmosman commented Jun 19, 2024

Sure, I tested on this site: https://www.cupraofficial.pl. It takes about 10 minutes to scrap everything and upload to Postgres VectorDB.

}
if (process.env.DEBUG === 'true') options.logger.info(`Finish ${relativeLinksMethod}`)
} else if (selectedLinks && selectedLinks.length > 0) {
if (process.env.DEBUG === 'true')
options.logger.info(`pages: ${JSON.stringify(selectedLinks)}, length: ${selectedLinks.length}`)
for (const page of selectedLinks.slice(0, limit)) {
docs.push(...(await cheerioLoader(page)))
docs.push(...((await cheerioLoader(page)) || []))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of placing the [], I think we should just modify the cheerioLoader function to return [] when catch (err)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants