This is a Python-based web scraper for extracting data from online newspapers. Tested on Jugantor and Prothom Alo newspaper from 2012 to 2024 in a Windows 11 machine.
Before installing the scraper, ensure you have the following prerequisites:
- A Windows machine (Tested on Windows 11)
- Python 3.12.0 installed on your system
- Anaconda for installing some packages
- NVIDIA CUDA Toolkit 12.4 for GPU accelerated processes
- Firefox installed as your browser. Make sure to install it in
C:\Program Files\Mozilla Firefox\
Only NVIDIA GPUs are supported for now and the ones which are listed on this page. If your graphics card has CUDA cores, then you can proceed further with setting up things. If not, contact the developer.
-
Make sure that Nvidia drivers are upto date.
-
Add anaconda to the environment and run the following commands in the command prompt.
conda install numba
conda install cudatoolkit
NOTE: If Anaconda is not added to the environment then navigate to anaconda installation and locate the Scripts directory and open the command prompt there.
-
Download the Tesseract OCR executable from here.
-
Install Tesseract OCR by following the installation instructions provided in the repository. Make sure to install it in
C:\Program Files (x86)\Tesseract-OCR
. -
Open a command prompt or Anaconda prompt.
-
Navigate to the directory where you have cloned or downloaded the epaper-scraper repository.
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv venv\Scripts\activate
-
Install the required Python packages using pip:
pip install -r requirements.txt
-
Test if Tesseract OCR is installed correctly by opening a Python prompt and running:
import pytesseract print(pytesseract)
If you don't encounter any errors, Tesseract OCR is installed successfully.
There are two ways to use this software: With GUI and Without GUI.
To use the epaper-scraper With GUI, follow these steps:
- Run
main.py
fromsrc
, which will initiate a desktop application like the following one:
- Navigate through the interface for using the supported capabilities of the software.
Note: The GUI lacks advanced features which are available in the "Without GUI" version. The interface is being constantly updated to implement these features.
To use the advanced features of epaper-scraper Without GUI, follow these steps:
-
Click and run
start_firefox.bat
file. Alternatively run the commands fromcmd.txt
. This will initilize a firefox browser instance. -
Call functions and adjust parameters from the python files of
src
and run. Example:python main.py
-
The scraper will start extracting data from the specified newspaper website and save it to the specified output directory.