This project automates the process of scraping product data from Amazon's website using Selenium and BeautifulSoup. The collected data is stored in HTML files and later extracted into a CSV file for further analysis.
- Web Scraping with Selenium: Automatically browses Amazon's search results and saves the HTML content of each product listing.
- Data Extraction with BeautifulSoup: Parses the saved HTML files to extract product titles, prices, and links.
- Data Storage: Saves the extracted data into a CSV file for easy access and analysis.
- Python 3.x
- Selenium
- BeautifulSoup
- pandas
- Microsoft Edge WebDriver
Install the necessary Python packages using pip:
pip install selenium beautifulsoup4 pandasDownload the Microsoft Edge WebDriver that matches your browser version
Run the project.py script to scrape data from Amazon and save it as HTML files:
python project.pyThis script will store the HTML files in the data directory.
Run the collect.py script to parse the HTML files and extract the product information into a CSV file:
python collect.pyThe extracted data will be saved in a data.csv file in the project directory.
├── project.py # Script to scrape data from Amazon and save as HTML files
├── data/ # Directory to store HTML files
├── collect.py # Script to extract data from HTML files and save to CSV
├── data.csv # Output CSV file with extracted data
└── README.md # Project documentation (this file)
- The script currently scrapes data from the first 19 pages of Amazon search results. You can adjust the number of pages by modifying the
forloop inproject.py. - If you want to run the script before that delete the data folder and csv file