Feature: Support for crawling dynamic javascript heavy site #10

indrajithi · 2024-06-14T16:13:57Z

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

Enable the crawler to fetch and parse content from JavaScript-heavy sites.
Use a headless browser to render JavaScript content. (explore playwright-python)
Ensure compatibility with the existing crawler structure and options.
Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

Single Headless Browser Instance:
- Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
Concurrency Management:
- Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
- Integrate the asynchronous fetching logic with our existing web crawler structure.
Error Handling:
- Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
- Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

indrajithi · 2024-06-16T12:29:26Z

Blocked by #17

Mews · 2024-06-17T19:43:41Z

What does blocked mean? I'd like to work on this but do you think I shouldn't?

indrajithi · 2024-06-17T19:51:58Z

@Mews You can work on this. I want to complete #17 before picking this up. I have merged the MR for that. Although there are a few more things to be done for #17. I believe this issue can be unblocked.

Since this is going to be relatively bigger story. Let us first discuss the approach and spec out the requirements and acceptance criteria.

Mews · 2024-06-17T19:59:56Z

Alright makes sense I'll work on the v1 milestone too then.
I'll pick this up when v1 gets released 👍

indrajithi · 2024-06-17T20:15:08Z

You can create Issues for things in #24 you find interesting and pick it up. Meanwhile I will spec out some details in this Issue. Also I think we should have this in v1.

indrajithi added the enhancement New feature or request label Jun 15, 2024

indrajithi added the blocked label Jun 16, 2024

indrajithi added this to the First major release v.1.0.0 milestone Jun 16, 2024

indrajithi removed the blocked label Jun 17, 2024

indrajithi assigned Mews Jun 17, 2024

indrajithi mentioned this issue Jun 17, 2024

First Major Release v1.0.0 #24

Open

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Support for crawling dynamic javascript heavy site #10

Feature: Support for crawling dynamic javascript heavy site #10

indrajithi commented Jun 14, 2024 •

edited

Loading

indrajithi commented Jun 16, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 17, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 17, 2024 •

edited

Loading

Feature: Support for crawling dynamic javascript heavy site #10

Feature: Support for crawling dynamic javascript heavy site #10

Comments

indrajithi commented Jun 14, 2024 • edited Loading

indrajithi commented Jun 16, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 17, 2024

Mews commented Jun 17, 2024

indrajithi commented Jun 17, 2024 • edited Loading

indrajithi commented Jun 14, 2024 •

edited

Loading

indrajithi commented Jun 17, 2024 •

edited

Loading