python-web-scraper

A ready-to-use web scraper with zero dependencies — just Python 3. No pip install, no virtual env, no setup. Clone and scrape.

Quick Start

git clone https://github.com/LuciferForge/python-web-scraper.git
cd python-web-scraper

# Scrape a page
python3 scraper.py https://example.com

# Extract all links
python3 scraper.py https://example.com -s "a[href]" -a href

# Extract all headings
python3 scraper.py https://example.com -s "h2"

# Get prices from a product page
python3 scraper.py https://example.com -s ".price"

# JSON output
python3 scraper.py https://example.com -s "h2" --json

# CSV output
python3 scraper.py https://example.com -s "a[href]" -a href --csv

Features

Zero dependencies — uses only Python standard library
CSS selectors — tag, .class, #id, tag[attr], tag[attr=value]
Multiple output formats — text, JSON, CSV
Built-in rate limiting — configurable delay between requests
Retry with backoff — automatic retries on failure
Rotating user agents — avoids basic bot detection
Crawl mode — follow links with depth control
Batch scraping — pass a file of URLs
No config files — everything via command line flags

Usage

Basic Scraping

# Full page text
python3 scraper.py https://news.ycombinator.com

# All links
python3 scraper.py https://news.ycombinator.com -s "a[href]" -a href

# All images
python3 scraper.py https://example.com -s "img" -a src

# Elements by class
python3 scraper.py https://example.com -s ".article-title"

# Elements by ID
python3 scraper.py https://example.com -s "#main-content"

Output Formats

# JSON (pipe to jq, save to file, etc.)
python3 scraper.py https://example.com -s "h2" --json

# CSV (open in Excel, import to database, etc.)
python3 scraper.py https://example.com -s "a[href]" -a href --csv

# Plain text (default)
python3 scraper.py https://example.com -s "p"

Crawling (Follow Links)

# Follow links 1 level deep
python3 scraper.py https://example.com --follow

# Follow links 3 levels deep, max 100 pages
python3 scraper.py https://example.com --follow -d 3 --max-pages 100

# Crawl with 2 second delay between requests
python3 scraper.py https://example.com --follow --delay 2.0

Batch Scraping

Create a file with URLs (one per line):

# urls.txt
https://example.com/page1
https://example.com/page2
https://example.com/page3

python3 scraper.py urls.txt -s "h1" --json

CSS Selector Reference

Selector	Matches	Example
`tag`	Elements by tag name	`h2`, `p`, `div`
`.class`	Elements by class	`.price`, `.title`
`#id`	Elements by ID	`#header`, `#content`
`tag.class`	Tag with class	`div.article`, `span.highlight`
`tag[attr]`	Tag with attribute	`a[href]`, `img[src]`
`tag[attr=val]`	Tag with attribute value	`input[type=text]`, `a[rel=nofollow]`
`[attr]`	Any element with attribute	`[data-id]`, `[role]`

Common Recipes

# Extract all email addresses from a page
python3 scraper.py https://example.com -s "a[href]" -a href | grep mailto:

# Get all image URLs as JSON
python3 scraper.py https://example.com -s "img" -a src --json

# Scrape product names and prices
python3 scraper.py https://store.example.com -s ".product-name" --json
python3 scraper.py https://store.example.com -s ".product-price" --json

# Extract table data
python3 scraper.py https://example.com -s "td" --csv

# Crawl a blog and extract all article titles
python3 scraper.py https://blog.example.com -s "h2.post-title" --follow -d 2 --json

Customization

The scraper is a single file (scraper.py). Common modifications:

Add a selector: The _matches() function handles CSS matching — extend it for new patterns
Add an output format: Add a function like output_json() / output_csv()
Change retry behavior: Modify fetch() — adjust retries, delay, timeout
Add headers: Edit the headers dict in fetch() for cookies, auth tokens, etc.

Requirements

Python 3.6+
No external packages

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-web-scraper

Quick Start

Features

Usage

Basic Scraping

Output Formats

Crawling (Follow Links)

Batch Scraping

CSS Selector Reference

Common Recipes

Customization

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

python-web-scraper

Quick Start

Features

Usage

Basic Scraping

Output Formats

Crawling (Follow Links)

Batch Scraping

CSS Selector Reference

Common Recipes

Customization

Requirements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages