-
Notifications
You must be signed in to change notification settings - Fork 439
Description
Hi, I like your project, and I see you are using BeautifulSoup and Parsel as parsers. Well, actually, there's a better and faster new parser, which is Scrapling.
Scrapling has several new features, including the use of a custom version of Camoufox, which is more stable than Camoufox's Python interface. Additionally, it features its own parser, built on lxml, similar to parsel. However, unlike Parsel, it doesn't only provide ways to select elements by CSS/XPATH selectors. Still, it also offers new options, such as selecting elements by their text content using lateral search or regular expressions, and a find
function similar to the one BS has, but more powerful and multiple times faster.
It also provides a way to make self-healing spiders that adapt to website design changes without AI, and it also provides a method to find elements that are similar to found elements like AutoScraper
, but it's way faster and better in this.
If all these features are not sufficient, then here are two benchmarks from the documentation that compare it to other libraries in the market:
Text Extraction Speed Test (5000 nested elements)
# | Library | Time (ms) | vs Scrapling |
---|---|---|---|
1 | Scrapling | 1.92 | 1.0x |
2 | Parsel/Scrapy | 1.99 | 1.036x |
3 | Raw Lxml | 2.33 | 1.214x |
4 | PyQuery | 20.61 | ~11x |
5 | Selectolax | 80.65 | ~42x |
6 | BS4 with Lxml | 1283.21 | ~698x |
7 | MechanicalSoup | 1304.57 | ~679x |
8 | BS4 with html5lib | 3331.96 | ~1735x |
Element Similarity & Text Search Performance
Library | Time (ms) | vs Scrapling |
---|---|---|
Scrapling | 1.87 | 1.0x |
AutoScraper | 10.24 | 5.476x |
It would be a solid addition to Crawlee to have it as an extra. What do you think?
I'm the author of Scrapling, so if there are any modifications needed to make this happen, tell me.