Skip to content

swecc-uw/swecc-email-scraper

Repository files navigation

SWECC Email Scraper

A Python CLI tool for analyzing email data in mbox format.

Features

  • 📧 Process mbox format email archives
  • 🔧 Unix-style pipeline architecture for flexible processing
  • 📊 Extendable framework for building analysis pipelines
  • Coming soon: More analysis processors...

Installation

From PyPI

pip install swecc-email-scraper

From Source

git clone https://github.com/swecc-uw/swecc-email-scraper.git
cd swecc-email-scraper
pip install -e ".[dev]"  # Install with development dependencies

# Run tests
pytest

Quick Start

The tool uses Unix pipes to compose commands. Each command does one thing and can be combined with others:

  1. Basic usage - get email stats with example processor:
swecc-email-scraper read mailbox.mbox \
  | swecc-email-scraper stats \
  | swecc-email-scraper format -f json > results.json
  1. List available processors:
swecc-email-scraper list-processors
  1. List available output formats:
swecc-email-scraper list-formats

Command Reference

Read Command

Reads an mbox file and outputs email data as JSON:

swecc-email-scraper read input.mbox > emails.json

Stats Command

Processes email data from stdin and outputs statistics:

cat emails.json | swecc-email-scraper stats > stats.json

Format Command

Formats JSON data using the specified formatter:

cat stats.json \
  | swecc-email-scraper format -f json \
  > formatted.json

Pipeline Examples

  1. Basic email statistics to terminal:
swecc-email-scraper read inbox.mbox \
  | swecc-email-scraper stats \
  | swecc-email-scraper format
  1. Save analysis to a file:
swecc-email-scraper read inbox.mbox \
  | swecc-email-scraper stats \
  > analysis.json
  1. Process with custom formatting:
swecc-email-scraper read inbox.mbox \
  | swecc-email-scraper stats \
  | swecc-email-scraper format -f json \
  > analysis.json
  1. Use with Unix tools:
# Filter emails before analysis
swecc-email-scraper read inbox.mbox \
  | jq 'map(select(.sender | contains("important")))' \
  | swecc-email-scraper stats

Extending the Tool

The tool is designed to be easily extensible. See CONTRIBUTING.md for detailed information on:

  • Creating custom processors
  • Adding new output formats
  • Contributing to the project
  • Development setup and guidelines

Architecture

The tool uses a Unix pipeline architecture where:

  1. read command converts mbox files to JSON email data
  2. Processor commands (like stats) transform or analyze the data
  3. format command handles output formatting
  4. Standard Unix pipes (|) connect the components

License

MIT License - See LICENSE file for details.

Acknowledgments

Developed as part of SWECC Labs at the University of Washington.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages