Skip to content

SirProdigle/linkpatrol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”— LinkPatrol

Go Version License Go Report Card GoDoc Build Status

A lightning-fast, concurrent web crawler and comprehensive link checker πŸš€

LinkPatrol is a high-performance Go-based tool designed to crawl websites and validate all types of links comprehensively. It uses concurrent processing with intelligent caching, rate limiting, and bot detection to efficiently check thousands of links, making it perfect for website health monitoring, SEO analysis, broken link detection, and web accessibility auditing.

✨ Features

  • πŸ” Comprehensive Web Crawling: Crawls websites and extracts links from HTML, CSS, JavaScript, and JSON content
  • πŸ§ͺ Advanced Link Testing: Tests HTTP/HTTPS URLs, fragments, relative links, and handles bot detection
  • ⚑ High Performance: Concurrent processing with configurable worker pools and atomic URL claiming
  • 🎯 Smart Caching: Avoids re-checking previously validated links with thread-safe cache management
  • πŸ›‘οΈ Intelligent Rate Limiting: Per-domain rate limiting that respects server resources
  • πŸ€– Bot Detection: Identifies and handles bot-detection mechanisms (HTTP 429, 999, 403)
  • πŸ”„ HTTPS/HTTP Fallback: Automatically tries HTTP when HTTPS fails
  • πŸ“Š Real-time Stats: Live monitoring of active workers, goroutines, and processing statistics
  • πŸ”§ Flexible Configuration: Command-line flags, environment variables, and config file support
  • 🎨 Beautiful Output: Color-coded results with dynamic terminal width detection and progress indicators
  • πŸ”— Fragment Validation: Validates anchor links by checking for target elements in HTML
  • 🚫 Domain Filtering: Built-in banned domain and path filtering for security
  • 🎯 Comprehensive Link Detection: Supports 15+ different link pattern types including HTML, CSS, JavaScript, and JSON

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/sirprodigle/linkpatrol.git
cd linkpatrol

# Build the binary
go build -o linkpatrol

# Or install directly
go install github.com/sirprodigle/linkpatrol@latest

Basic Usage

# Check links on a website
./linkpatrol https://example.com

# Enable verbose output with detailed logging
./linkpatrol https://example.com -v

# Customize concurrency and rate limiting
./linkpatrol https://example.com -n 16 -r 20

# High-performance mode with custom timeout
./linkpatrol https://example.com -n 50 -r 50 --timeout 10s

πŸ“– Usage Examples

Simple Link Check

# Check all links on a website
./linkpatrol https://example.com

Verbose Output with Detailed Logging

# Get detailed information about each link, worker activity, and processing steps
./linkpatrol https://example.com -v

High Performance Mode

# Use high concurrency for faster crawling of large websites
./linkpatrol https://example.com -n 100 -r 50 --timeout 30s

Custom Configuration

# Use custom timeout and conservative rate limiting
./linkpatrol https://example.com --timeout 10s -r 5 --no-truncate

Real-time Monitoring

# Monitor processing with live statistics (non-verbose mode shows real-time stats)
./linkpatrol https://example.com -n 25 -r 25

βš™οΈ Configuration

Command Line Options

Flag Description Default
target Target URL to scan (positional argument) ``
-v, --verbose Enable verbose logging with detailed output false
-n, --concurrency Max concurrent web crawlers and testers 50
--timeout Per-request timeout 30s
-r, --rate Max requests per second per domain 20
--width Terminal width override auto-detect
--no-truncate Don't truncate URLs or error messages false
-c, --config Path to configuration file ``
--cpuprofile Write CPU profile to file ``
--memprofile Write memory profile to file ``

Environment Variables

All flags can be set via environment variables with the LINKPATROL_ prefix:

export LINKPATROL_TARGET="https://example.com"
export LINKPATROL_VERBOSE="true"
export LINKPATROL_TIMEOUT="10s"
export LINKPATROL_CONCURRENCY="100"
export LINKPATROL_RATE="25"

Configuration File

Create a linkpatrol.yaml file in your project root:

target: "https://example.com"
verbose: true
concurrency: 50
timeout: 30s
rate: 20
width: 120
no-truncate: false

πŸ“Š Output Format

LinkPatrol provides clear, color-coded output:

πŸš€ LinkPatrol Starting ================================================================================================================================================================

🚢 Active Walkers: 0
πŸ§ͺ Active Testers: 0
🌐 Domain Count: 0
⚑ Total Goroutines: 115
βœ… Results Obtained: 27
πŸ“‹ Results To Test: 0
πŸ“ Paths To Walk: 0

πŸš€ Results ==================================================================================================================================================================
URL                                                                            Status   Emoji  Error                                                                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
https://example.com                                                            Live     βœ…      -                                                                              
https://broken-link.com                                                        Dead     ❌      HTTP 404                                                                       
https://slow-site.com                                                          Timeout  ⏰      context deadline exceeded                                                      
https://linkedin.com/in/user                                                   Bot      πŸ€–      HTTP 999                                                                       
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
πŸ“Š Total entries: 150
✨ All links are working!

Status Indicators

  • βœ… Live: Link is accessible and working
  • ❌ Dead: Link is broken or inaccessible (HTTP 4xx/5xx)
  • ⏰ Timeout: Request timed out
  • πŸ€– Bot: Bot detection triggered (HTTP 429, 999, 403)

πŸ” Supported Link Types

LinkPatrol uses advanced regex patterns to detect and validate various types of links:

HTML Links

  • Anchor tags: <a href="...">
  • Images: <img src="..."> and <img srcset="...">
  • Scripts: <script src="...">
  • Stylesheets: <link href="...">
  • Data sources: data-src, data-lazy-src

CSS Links

  • Imports: @import "..."
  • URLs: url(...) in CSS properties

JavaScript & JSON

  • JSON-LD: URLs in structured data
  • Raw HTTP/HTTPS: Direct URL references

Special Cases

  • Fragment links: #section (validated against page content)
  • Relative links: Resolved against base URL
  • Email links: mailto: addresses
  • Telephone links: tel: numbers

Security Features

  • Banned domains: static.cloudflareinsights.com
  • Banned paths: /wp-admin/, /wp-login.php, /cdn-cgi/
  • File filtering: Only follows HTML-like files for crawling

πŸ—οΈ Architecture

LinkPatrol uses a sophisticated multi-layered architecture for optimal performance:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web Walkers   │────│  Worker Pool    │────│  Link Testers   β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ HTML Parser   β”‚    β”‚ β€’ Concurrency   β”‚    β”‚ β€’ HTTP Clients  β”‚
β”‚ β€’ Regex Engine  β”‚    β”‚ β€’ Goroutines    β”‚    β”‚ β€’ Bot Detection β”‚
β”‚ β€’ URL Extractionβ”‚    β”‚ β€’ Channels      β”‚    β”‚ β€’ HTTPS Fallbackβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚   Cache System  β”‚
                       β”‚                 β”‚
                       β”‚ β€’ Atomic Claims β”‚
                       β”‚ β€’ Thread Safety β”‚
                       β”‚ β€’ Deduplication β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚  Rate Limiters  β”‚
                       β”‚                 β”‚
                       β”‚ β€’ Per-Domain    β”‚
                       β”‚ β€’ Token Bucket  β”‚
                       β”‚ β€’ Respect Robotsβ”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Development

Prerequisites

  • Go 1.24.5 or higher
  • Git

Building from Source

# Clone the repository
git clone https://github.com/sirprodigle/linkpatrol.git
cd linkpatrol
# Build the binary
go build -o linkpatrol

Project Structure

linkpatrol/
β”œβ”€β”€ internal/              # Internal packages
β”‚   β”œβ”€β”€ app/              # Main application logic and orchestration
β”‚   β”œβ”€β”€ cache/            # Thread-safe result caching with atomic operations
β”‚   β”œβ”€β”€ config/           # Configuration management (flags, env vars, files)
β”‚   β”œβ”€β”€ logger/           # Advanced logging with dynamic terminal formatting
β”‚   β”œβ”€β”€ tester/           # Link testing with bot detection and fallback
β”‚   β”œβ”€β”€ walker/           # Web crawling with comprehensive regex patterns
β”‚   └── workers/          # Worker pool management and statistics
β”œβ”€β”€ test_data/            # Test data for development and validation
└── main.go              # Application entry point with profiling support

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (go test ./...)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built with Cobra for CLI
  • Configuration management with Viper
  • File watching with fsnotify

πŸ“ˆ Performance

LinkPatrol is engineered for maximum speed and efficiency:

  • Concurrent Processing: Configurable worker pools for walkers and testers run simultaneously
  • Atomic URL Claiming: Thread-safe deduplication prevents redundant processing
  • Smart Caching: Avoids re-checking previously validated links with intelligent cache management
  • Per-Domain Rate Limiting: Respects server resources while maintaining optimal throughput
  • Memory Efficient: Streams processing with minimal memory footprint
  • Bot Detection: Handles anti-bot measures without disrupting legitimate crawling
  • Connection Pooling: Reuses HTTP connections for improved performance

Benchmarks

On a typical website with 1000+ links:

  • LinkPatrol: ~15-45 seconds (depending on concurrency settings)
  • Memory usage: <15MB for most websites
  • Concurrent connections: Up to 2000 idle connections with intelligent reuse

πŸ› Troubleshooting

Common Issues

Slow Performance

  • Increase concurrency: -n 100 -r 50
  • Reduce rate limiting for faster scanning: -r 100
  • Use profiling to identify bottlenecks: --cpuprofile cpu.prof

Timeout Errors

  • Increase timeout: --timeout 60s
  • Check network connectivity and DNS resolution
  • Verify target servers are responsive
  • Consider bot detection issues

Bot Detection Issues

  • Look for πŸ€– indicators in output
  • These are expected for some sites (LinkedIn, etc.)
  • Use -v flag to see detailed bot detection logs

Memory Issues

  • Reduce concurrency settings: -n 25
  • Monitor with memory profiling: --memprofile mem.prof
  • Check for memory leaks in long-running processes

Getting Help


Made with ❀️ by the LinkPatrol team

GitHub stars GitHub forks GitHub issues

About

Go-based link checking CLI tool for HTML/MD

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages