Herald Rules Scraper

A web scraper to extract Herald rules from Phabricator in a machine-parseable format.

Overview

This tool extracts Herald rules from a Phabricator instance (specifically https://phabricator.services.mozilla.com/) and outputs them as structured JSON data with all PHIDs resolved to human-readable names.

Features

Extracts all Herald rules with conditions and actions
Resolves PHIDs to usernames, emails, and group names
Extracts group membership for reviewer groups
Outputs structured JSON with complete metadata
Uses Pydantic for data validation and type safety

Installation

# Install dependencies
pip install -r requirements.txt

# For development
pip install -r requirements-dev.txt

# Or install as a package
pip install -e .

Usage

herald-scraper \
  --url https://phabricator.services.mozilla.com \
  --conduit-token $CONDUIT_API_TOKEN \
  --pmo-cookie $PMO_COOKIE \
  [--max-pages P] \
  [--max-groups G] \
  [--max-rules R] \
  --output herald_rules.$(date -Iseconds).json

Get $CONDUIT_API_TOKEN from https://phabricator.services.mozilla.com/settings/user/YOUR_USERNAME/page/apitokens/

Get $PMO_COOKIE from by logging in to https://people.mozilla.org/ and getting the value of the pmo-access cookie.

Development

Setup

# Install package with dev dependencies
pip install -e ".[dev]"

Running Tests

pytest

Code Formatting

black herald_scraper tests
ruff check herald_scraper tests

Type Checking

mypy herald_scraper

Data Model

The output JSON structure includes:

rules: List of Herald rules with conditions and actions
groups: Dictionary of reviewer groups with their members
metadata: Information about the extraction (timestamp, counts, instance)

Example JSON output

{
  "rules": [
    {
      "id": "H123",
      "name": "Rule Name",
      "author": "username",
      "status": "active",
      "type": "differential-revision",
      "conditions": [
        {
          "type": "repository",
          "operator": "is-any-of",
          "value": ["mozilla-central", "firefox-autoland"]
        },
        {
          "type": "differential-diff-content",
          "operator": "matches-regexp",
          "value": "^path/to/.*"
        }
      ],
      "actions": [
        {
          "type": "add-reviewers",
          "reviewers": [
            {
              "target": "reviewer-group-name",
              "blocking": true
            },
            {
              "target": "individual-user",
              "blocking": false
            }
          ]
        }
      ]
    }
  ],
  "groups": {
    "reviewer-group-name": {
      "id": "reviewer-group-name",
      "display_name": "Reviewer Group Name",
      "members": ["user-a", "user-b", "user-c"]
    }
  },
  "github_users": {
    "user-a": {
      "username": "github-user-a",
      "user_id": 11111111
    },
    "user-b": {
      "username": "github-user-b",
      "user_id": 22222222
    }
  },
  "unresolved_users": [
    {
      "phabricator_username": "user-c",
      "reason": "no_github_linked_or_not_found",
      "referenced_in": ["group:reviewer-group-name"]
    }
  ],
  "metadata": {
    "extracted_at": "2026-01-21T12:00:00Z",
    "total_rules": 123,
    "total_groups": 2,
    "total_users_resolved": 2,
    "total_users_unresolved": 1,
    "phabricator_instance": "phabricator.services.mozilla.com",
    "scrape_status": {
      "rules_complete": true,
      "groups_complete": true,
      "github_complete": true
    }
  }
}

Notes:

github_users is a single mapping from Phabricator username to {username, user_id} object
GitHub info for rule authors, reviewers, and group members is looked up via github_users (avoids duplication)
groups.members is a simple list of usernames; GitHub info is in github_users
scrape_status in metadata enables resumable scraping

Scripts

The scripts/ directory contains utility scripts for development and testing:

Fixture Collection

fetch_fixtures.py: Fetch Herald rule pages from Phabricator for test fixtures
- Requires authentication via PHABRICATOR_SESSION_COOKIE environment variable
- Can fetch specific rules, all rules, or recommended diverse set
- Usage: python scripts/fetch_fixtures.py --rules H420 H422 H425

Analysis Scripts

analyze_listing.py: Analyze the Herald rules listing page (BeautifulSoup-based)
- Extracts rule IDs and metadata from listing HTML
- Identifies PHIDs and project references
- Suggests diverse rules to fetch for testing
- Usage: python scripts/analyze_listing.py
analyze_listing_simple.py: Simplified listing analysis (regex-based)
- Faster analysis using regular expressions
- Provides recommendations for diverse test fixtures
- Shows PHID types and project references
- Usage: python scripts/analyze_listing_simple.py
inspect_fixtures.py: Inspect structure of saved rule fixtures
- Shows rule IDs, titles, breadcrumbs
- Identifies rule types (Global, Personal, Object)
- Helps understand HTML structure for parser development
- Usage: python scripts/inspect_fixtures.py
extract_conditions_actions.py: Extract conditions and actions text
- Parses rule fixtures and extracts raw text sections
- Shows natural language structure of conditions and actions
- Useful for understanding parser requirements
- Usage: python scripts/extract_conditions_actions.py
analyze_html_structure.py: Detailed HTML structure analysis
- Examines HTML elements and their relationships
- Extracts regexp patterns and reviewer names
- Helps with parser implementation
- Usage: python scripts/analyze_html_structure.py

License

Mozilla Public License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
herald_scraper		herald_scraper
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
make_path_reviewer_list.sh		make_path_reviewer_list.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Herald Rules Scraper

Overview

Features

Installation

Usage

Development

Setup

Running Tests

Code Formatting

Type Checking

Data Model

Scripts

Fixture Collection

Analysis Scripts

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Herald Rules Scraper

Overview

Features

Installation

Usage

Development

Setup

Running Tests

Code Formatting

Type Checking

Data Model

Scripts

Fixture Collection

Analysis Scripts

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages