Web scraping tool that extracts data from a list of URLs. The tool uses the Selenium WebDriver to scrape the data and save it in a JSON file. The tool also takes a screenshot of the URL and saves it as a PNG file.
git clone https://github.com/ronnuriel/Web-Browser-Module.git
cd Web-Browser-Module
Run Docker container linux or mac
./run.sh
Run Docker container windows
./run.ps1
if you have permission issues, run the following command
chmod +x run.sh
Run the application without Docker
python browser_module.py
.
├── Dockerfile # Dockerfile for containerizing the application.
├── README.md # This README file.
├── browser_module.py # Main Python script for web scraping.
├── input # Directory containing input files.
│ └── urls.input # Text file with URLs to scrape.
├── output # Output directory for scraped data.
│ └── url_X # Each subdirectory contains results for a URL.
│ ├── browse.json # JSON file with scraped data.
│ └── screenshot.png # Screenshot of the URL.
├── requirements.txt # Python dependencies.
├── run.sh # Shell script to run the application.
└── testMain.py # Unit tests for the application.
'
clean output directory
rm -rf output/*
python -m pytest testMain.py