NYC COVID-19 Testing Wait Times PDF Scraper

Challenge accepted!

This repo will automatically run the scraper every 15 minutes with a cronjob set up through GitHub Actions. Thanks to this wonderful blog post from Jason Etcovitch that had most of the action automation setup I pulled from for the workflow.

The csv filenames are structured as {two-hour time window}-{scrape timestamp}.csv.

Changelog

2020-11-30 21:07: Data now includes the time window as a column as well as the scrape time. The latest data is also stored in latest.csv. Corrected some data cleaning issues with characters being parsed incorrectly and newlines being included in the wait times column.
2020-11-30 21:14: Data moved to the data folder, copy of latest.csv included in root dir.
2020-12-01 08:28: Added md5 hash of PDF content to check if PDF is a new file. If not a new file, don't parse or add a new csv.
2020-12-01 11:20: Made scraper more robust to changes in PDF structure.
2020-12-05 3:00: Updated to account for new structure of PDF: including update times and a new PDF structure (increased resolution). Added playwright to take screenshot of scraped PDF and include in docs for debugging.

Name		Name	Last commit message	Last commit date
Latest commit History 3,038 Commits
.github/workflows		.github/workflows
data		data
docs		docs
.gitignore		.gitignore
latest.csv		latest.csv
md5		md5
readme.md		readme.md
requirements.txt		requirements.txt
scrape.py		scrape.py
site.py		site.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC COVID-19 Testing Wait Times PDF Scraper

Changelog

About

Releases

Packages

Contributors 2

Languages

pmbaumgartner/nyc-test-tracker

Folders and files

Latest commit

History

Repository files navigation

NYC COVID-19 Testing Wait Times PDF Scraper

Changelog

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages