A lightning-fast, concurrent web crawler and comprehensive link checker π
LinkPatrol is a high-performance Go-based tool designed to crawl websites and validate all types of links comprehensively. It uses concurrent processing with intelligent caching, rate limiting, and bot detection to efficiently check thousands of links, making it perfect for website health monitoring, SEO analysis, broken link detection, and web accessibility auditing.
- π Comprehensive Web Crawling: Crawls websites and extracts links from HTML, CSS, JavaScript, and JSON content
- π§ͺ Advanced Link Testing: Tests HTTP/HTTPS URLs, fragments, relative links, and handles bot detection
- β‘ High Performance: Concurrent processing with configurable worker pools and atomic URL claiming
- π― Smart Caching: Avoids re-checking previously validated links with thread-safe cache management
- π‘οΈ Intelligent Rate Limiting: Per-domain rate limiting that respects server resources
- π€ Bot Detection: Identifies and handles bot-detection mechanisms (HTTP 429, 999, 403)
- π HTTPS/HTTP Fallback: Automatically tries HTTP when HTTPS fails
- π Real-time Stats: Live monitoring of active workers, goroutines, and processing statistics
- π§ Flexible Configuration: Command-line flags, environment variables, and config file support
- π¨ Beautiful Output: Color-coded results with dynamic terminal width detection and progress indicators
- π Fragment Validation: Validates anchor links by checking for target elements in HTML
- π« Domain Filtering: Built-in banned domain and path filtering for security
- π― Comprehensive Link Detection: Supports 15+ different link pattern types including HTML, CSS, JavaScript, and JSON
# Clone the repository
git clone https://github.com/sirprodigle/linkpatrol.git
cd linkpatrol
# Build the binary
go build -o linkpatrol
# Or install directly
go install github.com/sirprodigle/linkpatrol@latest# Check links on a website
./linkpatrol https://example.com
# Enable verbose output with detailed logging
./linkpatrol https://example.com -v
# Customize concurrency and rate limiting
./linkpatrol https://example.com -n 16 -r 20
# High-performance mode with custom timeout
./linkpatrol https://example.com -n 50 -r 50 --timeout 10s# Check all links on a website
./linkpatrol https://example.com# Get detailed information about each link, worker activity, and processing steps
./linkpatrol https://example.com -v# Use high concurrency for faster crawling of large websites
./linkpatrol https://example.com -n 100 -r 50 --timeout 30s# Use custom timeout and conservative rate limiting
./linkpatrol https://example.com --timeout 10s -r 5 --no-truncate# Monitor processing with live statistics (non-verbose mode shows real-time stats)
./linkpatrol https://example.com -n 25 -r 25| Flag | Description | Default |
|---|---|---|
target |
Target URL to scan (positional argument) | `` |
-v, --verbose |
Enable verbose logging with detailed output | false |
-n, --concurrency |
Max concurrent web crawlers and testers | 50 |
--timeout |
Per-request timeout | 30s |
-r, --rate |
Max requests per second per domain | 20 |
--width |
Terminal width override | auto-detect |
--no-truncate |
Don't truncate URLs or error messages | false |
-c, --config |
Path to configuration file | `` |
--cpuprofile |
Write CPU profile to file | `` |
--memprofile |
Write memory profile to file | `` |
All flags can be set via environment variables with the LINKPATROL_ prefix:
export LINKPATROL_TARGET="https://example.com"
export LINKPATROL_VERBOSE="true"
export LINKPATROL_TIMEOUT="10s"
export LINKPATROL_CONCURRENCY="100"
export LINKPATROL_RATE="25"Create a linkpatrol.yaml file in your project root:
target: "https://example.com"
verbose: true
concurrency: 50
timeout: 30s
rate: 20
width: 120
no-truncate: falseLinkPatrol provides clear, color-coded output:
π LinkPatrol Starting ================================================================================================================================================================
πΆ Active Walkers: 0
π§ͺ Active Testers: 0
π Domain Count: 0
β‘ Total Goroutines: 115
β
Results Obtained: 27
π Results To Test: 0
π Paths To Walk: 0
π Results ==================================================================================================================================================================
URL Status Emoji Error
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
https://example.com Live β
-
https://broken-link.com Dead β HTTP 404
https://slow-site.com Timeout β° context deadline exceeded
https://linkedin.com/in/user Bot π€ HTTP 999
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Total entries: 150
β¨ All links are working!
- β Live: Link is accessible and working
- β Dead: Link is broken or inaccessible (HTTP 4xx/5xx)
- β° Timeout: Request timed out
- π€ Bot: Bot detection triggered (HTTP 429, 999, 403)
LinkPatrol uses advanced regex patterns to detect and validate various types of links:
- Anchor tags:
<a href="..."> - Images:
<img src="...">and<img srcset="..."> - Scripts:
<script src="..."> - Stylesheets:
<link href="..."> - Data sources:
data-src,data-lazy-src
- Imports:
@import "..." - URLs:
url(...)in CSS properties
- JSON-LD: URLs in structured data
- Raw HTTP/HTTPS: Direct URL references
- Fragment links:
#section(validated against page content) - Relative links: Resolved against base URL
- Email links:
mailto:addresses - Telephone links:
tel:numbers
- Banned domains:
static.cloudflareinsights.com - Banned paths:
/wp-admin/,/wp-login.php,/cdn-cgi/ - File filtering: Only follows HTML-like files for crawling
LinkPatrol uses a sophisticated multi-layered architecture for optimal performance:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Web Walkers ββββββ Worker Pool ββββββ Link Testers β
β β β β β β
β β’ HTML Parser β β β’ Concurrency β β β’ HTTP Clients β
β β’ Regex Engine β β β’ Goroutines β β β’ Bot Detection β
β β’ URL Extractionβ β β’ Channels β β β’ HTTPS Fallbackβ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Cache System β
β β
β β’ Atomic Claims β
β β’ Thread Safety β
β β’ Deduplication β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Rate Limiters β
β β
β β’ Per-Domain β
β β’ Token Bucket β
β β’ Respect Robotsβ
βββββββββββββββββββ
- Go 1.24.5 or higher
- Git
# Clone the repository
git clone https://github.com/sirprodigle/linkpatrol.git
cd linkpatrol
# Build the binary
go build -o linkpatrollinkpatrol/
βββ internal/ # Internal packages
β βββ app/ # Main application logic and orchestration
β βββ cache/ # Thread-safe result caching with atomic operations
β βββ config/ # Configuration management (flags, env vars, files)
β βββ logger/ # Advanced logging with dynamic terminal formatting
β βββ tester/ # Link testing with bot detection and fallback
β βββ walker/ # Web crawling with comprehensive regex patterns
β βββ workers/ # Worker pool management and statistics
βββ test_data/ # Test data for development and validation
βββ main.go # Application entry point with profiling support
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
go test ./...) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
LinkPatrol is engineered for maximum speed and efficiency:
- Concurrent Processing: Configurable worker pools for walkers and testers run simultaneously
- Atomic URL Claiming: Thread-safe deduplication prevents redundant processing
- Smart Caching: Avoids re-checking previously validated links with intelligent cache management
- Per-Domain Rate Limiting: Respects server resources while maintaining optimal throughput
- Memory Efficient: Streams processing with minimal memory footprint
- Bot Detection: Handles anti-bot measures without disrupting legitimate crawling
- Connection Pooling: Reuses HTTP connections for improved performance
On a typical website with 1000+ links:
- LinkPatrol: ~15-45 seconds (depending on concurrency settings)
- Memory usage: <15MB for most websites
- Concurrent connections: Up to 2000 idle connections with intelligent reuse
Slow Performance
- Increase concurrency:
-n 100 -r 50 - Reduce rate limiting for faster scanning:
-r 100 - Use profiling to identify bottlenecks:
--cpuprofile cpu.prof
Timeout Errors
- Increase timeout:
--timeout 60s - Check network connectivity and DNS resolution
- Verify target servers are responsive
- Consider bot detection issues
Bot Detection Issues
- Look for π€ indicators in output
- These are expected for some sites (LinkedIn, etc.)
- Use
-vflag to see detailed bot detection logs
Memory Issues
- Reduce concurrency settings:
-n 25 - Monitor with memory profiling:
--memprofile mem.prof - Check for memory leaks in long-running processes
- π Documentation
- π Issue Tracker
- π¬ Discussions