Create scraping.md #395

fgregg · 2024-12-20T21:01:15Z

Scraping guidance

This records the scraping guidance as discussed in #393

Scraping guidance

hancush

Couple of line edits, but otherwise this looks good. Thanks for capturing it, @fgregg!

hancush · 2025-01-09T19:06:25Z

etl/scraping.md

+## Libraries
+DataMade prefers that web scrapers use the [`scrapy` framework](https://scrapy.org/). Here's what we appreciate about scrapy:
+
+1. Fast. `scrapy` wants to parallel, and so can pull a lot of data very quickly.


Missing a word here, maybe?

etl/scraping.md

hancush · 2025-01-09T19:07:31Z

etl/scraping.md

+1. Fast. `scrapy` wants to parallel, and so can pull a lot of data very quickly.
+2. Opinionated. `scrapy` scrapers expect files to be organized in particular ways. This is good for reviewing PRs.
+3. Popular. `scrapy` is the most popular scraping framework, so you can find lots of QAs and extensions on the internet.
+4. Extensible. If you need a scrapy that can run some javascript, you can stay within the `scrapy` framework and use middleware like [`scrapy-playwright`](https://github.com/scrapy-plugins/scrapy-playwright). If you need IP rotation or more advanced anti-bot circumventions, there is a good migration path from a normal scrapy script to [Zyte](https://scrapy-zyte-api.readthedocs.io/en/latest/).


Nice, thanks for including this resource!

etl/scraping.md

hancush · 2025-01-09T19:08:40Z

etl/scraping.md

+2. Tolerant error handling. With most of the serial, `requests`-based scrapers DataMade has written, if there is an exception, everything comes to a grinding halt and the process exits with a non-zero exit code. This is sometimes annoying, but makes it quite clear that something has gone wrong and it is easy to incorporate into longer data pipelines that should continue or not based upon exit codes (Makefiles, github actions are two examples). `scrapy` has another different philosophy, and every request you make could fail and the process would still exit successfully. If you only want to scrape without errors, then you need to affirmatively change a setting to exit on the first error, and [you need to do something like this](https://github.com/scrapy/scrapy/issues/1231#issuecomment-102409470) to change the exit code if there is an error.
+3. Cache eviction. If you used `scrapy`'s caching mechanism, and you want to remove a small number of cached responses, it's pretty tricky to do the surgery to find the files and remove them.
+
+We can put together a cookie cutter to help address some of those downsides.


Can you open an issue for this?

Co-authored-by: hannah cushman garland <[email protected]>

Create scraping.md

9fac04d

Scraping guidance

fgregg requested a review from hancush December 20, 2024 21:01

hancush requested changes Jan 9, 2025

View reviewed changes

fgregg and others added 2 commits January 9, 2025 14:31

Update scraping.md

36adff2

Apply suggestions from code review

b531c64

Co-authored-by: hannah cushman garland <[email protected]>

fgregg mentioned this pull request Jan 9, 2025

Create cookie cutter for some mitigations for scrapy downsides #396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create scraping.md #395

Create scraping.md #395

fgregg commented Dec 20, 2024

hancush left a comment

hancush Jan 9, 2025

hancush Jan 9, 2025

hancush Jan 9, 2025

fgregg Jan 9, 2025

Create scraping.md #395

Are you sure you want to change the base?

Create scraping.md #395

Conversation

fgregg commented Dec 20, 2024

Scraping guidance

hancush left a comment

Choose a reason for hiding this comment

hancush Jan 9, 2025

Choose a reason for hiding this comment

hancush Jan 9, 2025

Choose a reason for hiding this comment

hancush Jan 9, 2025

Choose a reason for hiding this comment

fgregg Jan 9, 2025

Choose a reason for hiding this comment