Guardian: Web Scraping and Article Extraction

The Guardian class is a Python utility for fetching HTML content from a given URL and extracting cleaned article text. It uses Selenium for web scraping and BeautifulSoup for parsing the HTML. Whether you're building a news aggregator, content analysis tool, or simply want to extract article text, this class can be a helpful addition to your project.

Installation

Make sure you have Python 3.x installed.
Install the required packages using pip:
```
pip install -r requirements.txt
```
Download the geckodriver (Firefox WebDriver) and add its location to your system PATH.

Usage

from guardian import Guardian

# Create an instance of the Guardian class
guardian = Guardian()

# Example usage: Get article content from a URL
article_url = "https://www.theguardian.com/technology/2024/feb/29/former-crypto-director-banned-from-leaving-australia-after-blockchain-global-collapsed-owing-58m"
article_text = guardian.get(article_url)

if article_text:
    print("Article content:")
    print(article_text)
else:
    print("Error fetching article content. Please check the URL or try again later.")

Methods

`get(url: str) -> str`

Retrieves the cleaned article text from a given URL.

Args:
- url (str): The URL of the article.
Returns:
- str: The cleaned article text.

Error Handling

The Guardian class includes error handling for scenarios such as connection issues, missing article content, or invalid URLs. If an error occurs during fetching or parsing, appropriate error messages are returned.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
guardian.py		guardian.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guardian: Web Scraping and Article Extraction

Installation

Usage

Methods

`get(url: str) -> str`

Error Handling

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Guardian: Web Scraping and Article Extraction

Installation

Usage

Methods

get(url: str) -> str

Error Handling

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`get(url: str) -> str`

Packages