From 4bc81654a10ce04bd1f74b61f339a2c90164ac1e Mon Sep 17 00:00:00 2001 From: Forest Gregg Date: Fri, 13 Jan 2023 15:12:37 -0500 Subject: [PATCH] recommendation for adoption of github actions for scraping --- .../recommendation-of-adoption.md | 140 ++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 scraping/github-actions/recommendation-of-adoption.md diff --git a/scraping/github-actions/recommendation-of-adoption.md b/scraping/github-actions/recommendation-of-adoption.md new file mode 100644 index 0000000..420ba66 --- /dev/null +++ b/scraping/github-actions/recommendation-of-adoption.md @@ -0,0 +1,140 @@ +# Recommendation of Adoption: GitHub Actions Ongoing Scraping + +We recommend GitHub Actions as DataMade’s preferred scraping platform when ongoing scraping is required. + + +## Proof of Concept and Pilot + +We have completed three projects that use GitHub Actions for Scraping. + + + +* [Airline scraping](https://github.com/datamade/airline-points/) +* [Florida Sex Offender Registry](https://github.com/datamade/florida-sex-offender-registry) +* [Chicago Council Scrapers](https://github.com/datamade/chicago-council-scrapers/) + + +## Prerequisite Skills + +GitHub actions has its own YAML configuration syntax. Familiarity with this syntax is helpful, though there [is excellent documentation](https://docs.github.com/en/actions). + + +## Maintenance and Cost Outlook + +GitHub Actions is a deeply integrated feature of GitHub, and it is not accessible from outside GitHub. Using GitHub actions extensively will make it more difficult to switch to an alternative code repository in the future. GitLab has an equivalent feature, but the configuration syntax is not the same. + +It is unlikely that GitHub will stop supporting Actions, but it is possible that they will increase costs at some point in the future. GitHub Actions offers a way to use self-hosted runners, which mitigates that risk. For the airline scraping, we are currently using [cirun.io](https://cirun.io), a service that makes it easy to set up self-hosted action runners on various cloud platforms. We use cirun.io with Azure spot instances to get a cost that is much better than EC2. + + +## Comparison to Existing Tools + +DataMade has used two other approaches for running ongoing scrapers. Cron-scheduled processes on EC2 and AirFlow. Here’s how those approaches compare to GithubActions. + +The only advantage that Airflow has over GitHub actions is that it is easy to see the next scheduled run, and that there is no limit on the length of time a job takes. + +The only advantage that EC2 has is that there is no limit on the length of time a job takes. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ EC2 Process + Airflow + GitHub Actions +
view logs of particular run + manually set up logging to disk. to view logs, we would ssh into the server and grep the logs. + not possible? + integrated +
dashboard to see status of a runs + no + yes + yes +
error logs + sentry integration + sentry integration + integrated +
deploy changes to scraper + codedeploy + codedeploy + git push +
manual dispatch + ssh into server + press-button + press-button +
see upcoming run + inspect crontab + integrated + inspect workflow file +
scheduling + crontab + Airflow Task definition + workflow syntax +
precise scheduling + yes + yes + can delayed by a few minutes +
time limits on a scrape + no + no + 6 hours on GitHub hosted actions and 24 hours on self-hosted +
+