Skip to content

Add scrapling as a parser #1392

@D4Vinci

Description

@D4Vinci

Hi, I like your project, and I see you are using BeautifulSoup and Parsel as parsers. Well, actually, there's a better and faster new parser, which is Scrapling.

Scrapling has several new features, including the use of a custom version of Camoufox, which is more stable than Camoufox's Python interface. Additionally, it features its own parser, built on lxml, similar to parsel. However, unlike Parsel, it doesn't only provide ways to select elements by CSS/XPATH selectors. Still, it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a find function similar to the one BS has, but more powerful and multiple times faster.

It also provides a way to make self-healing spiders that adapt to website design changes without AI, and it also provides a method to find elements that are similar to found elements like AutoScraper, but it's way faster and better in this.

If all these features are not sufficient, then here are two benchmarks from the documentation that compare it to other libraries in the market:

Text Extraction Speed Test (5000 nested elements)

# Library Time (ms) vs Scrapling
1 Scrapling 1.92 1.0x
2 Parsel/Scrapy 1.99 1.036x
3 Raw Lxml 2.33 1.214x
4 PyQuery 20.61 ~11x
5 Selectolax 80.65 ~42x
6 BS4 with Lxml 1283.21 ~698x
7 MechanicalSoup 1304.57 ~679x
8 BS4 with html5lib 3331.96 ~1735x

Element Similarity & Text Search Performance

Library Time (ms) vs Scrapling
Scrapling 1.87 1.0x
AutoScraper 10.24 5.476x

It would be a solid addition to Crawlee to have it as an extra. What do you think?
I'm the author of Scrapling, so if there are any modifications needed to make this happen, tell me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions