Skip to content

Webscraping ICC.Cricket.com and espncricinfo.com to scrape cricket data and storing it in AWS s3 and creating an RDS warehouse

Notifications You must be signed in to change notification settings

alphadatagamma/The-Cricket-Project---Advanced-Web-scraping-and-AWS-Cloud-Services

Repository files navigation

The Cricket project

Cricket is a religion in India and I am a big fan of the sport. A game with more than a Billion fans worldwide and a global industry worth more than 100s of Billions of dollars.

This love for the game and the determination to apply some my Data Science skills has led to the inception of this project.

Phase 1: Web scraping data from various sources like ICC.com ESPN Cricinfo, CricBuzz.com (Python, Beautiful Soup, RegEx, and AWS EC2)

Phase 2: Data Processing and Cleaning and storing as comma separated in AWS S3

Phase 3: Setting up ETL jobs to create Relational database using AWS Glue and AWS Redshift

Phase 1. Webscraping

Webscraping scripts for The Cricket Project

The following data sources are webscraped

  1. https://www.icc-cricket.com/ - Mens cricket Test, ODI and T-20 Team rankings
  2. https://www.icc-cricket.com/ - Mens cricket Test, ODI and T-20 Player rankings
  3. https://www.espncricinfo.com/ - Player id, name and country details for all capped players, and their capping date and opposition
  4. https://www.espncricinfo.com/ - For all the capped players(5740), we need details such as full name, their Date Of Birth, Place of Birth style of batting or bowling, whether the keep wickets, the role in the team(Bat/Bowl/All-rounder). Getting this data was particularly tricky as I had to extract each attribute separately, and combine it as a row of data for each player. The web scraping script runs for each player in the Capped players dataset to extract the attributes one by one. Doing it on the local machine takes a lot of time, so tried out various things : I tried using the Python
  • AWS EC2, AWS- Sagemaker, Google Colab, Data Bricks and noted the time for scraping 100 records. We will go with whatever wins. Also, will try to implement parallel processing to reduce runtime. - Multipurpose in python doen't seem to work
  • AWS sagemaker is the winner with 50 records in 20 seconds
  1. https://www.espncricinfo.com/ - Player stats - bowling and batting and fielding

About

Webscraping ICC.Cricket.com and espncricinfo.com to scrape cricket data and storing it in AWS s3 and creating an RDS warehouse

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published