E-paper Scraper

This is a Python-based web scraper for extracting data from online newspapers. Tested on Jugantor and Prothom Alo newspaper from 2012 to 2024 in a Windows 11 machine.

Installation

Prerequisites

Before installing the scraper, ensure you have the following prerequisites:

A Windows machine (Tested on Windows 11)
Python 3.12.0 installed on your system
Anaconda for installing some packages
NVIDIA CUDA Toolkit 12.4 for GPU accelerated processes
Firefox installed as your browser. Make sure to install it in C:\Program Files\Mozilla Firefox\

Installation Steps

CUDA Toolkits

Only NVIDIA GPUs are supported for now and the ones which are listed on this page. If your graphics card has CUDA cores, then you can proceed further with setting up things. If not, contact the developer.

Make sure that Nvidia drivers are upto date.
Add anaconda to the environment and run the following commands in the command prompt.

conda install numba
conda install cudatoolkit

NOTE: If Anaconda is not added to the environment then navigate to anaconda installation and locate the Scripts directory and open the command prompt there.

Tesseract

Download the Tesseract OCR executable from here.
Install Tesseract OCR by following the installation instructions provided in the repository. Make sure to install it in C:\Program Files (x86)\Tesseract-OCR.
Open a command prompt or Anaconda prompt.
Navigate to the directory where you have cloned or downloaded the epaper-scraper repository.
Create and activate a virtual environment (optional but recommended):
```
python -m venv venv
venv\Scripts\activate
```
Install the required Python packages using pip:
```
pip install -r requirements.txt
```
Test if Tesseract OCR is installed correctly by opening a Python prompt and running:
```
import pytesseract
print(pytesseract)
```
If you don't encounter any errors, Tesseract OCR is installed successfully.

Usage

There are two ways to use this software: With GUI and Without GUI.

To use the epaper-scraper With GUI, follow these steps:

Run main.py from src, which will initiate a desktop application like the following one:

Navigate through the interface for using the supported capabilities of the software.

Note: The GUI lacks advanced features which are available in the "Without GUI" version. The interface is being constantly updated to implement these features.

To use the advanced features of epaper-scraper Without GUI, follow these steps:

Click and run start_firefox.bat file. Alternatively run the commands from cmd.txt. This will initilize a firefox browser instance.
Call functions and adjust parameters from the python files of src and run. Example:
```
python main.py
```
The scraper will start extracting data from the specified newspaper website and save it to the specified output directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

E-paper Scraper

Installation

Prerequisites

Installation Steps

CUDA Toolkits

Tesseract

Usage

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

E-paper Scraper

Installation

Prerequisites

Installation Steps

CUDA Toolkits

Tesseract

Usage