|
| 1 | +# AI-Powered Link Checker Action |
| 2 | + |
| 3 | +This GitHub Action scans HTML files for web links and validates them, providing AI-powered suggestions for improvements. It's designed to replace traditional link checkers like `lychee` with enhanced functionality that not only detects broken links but also suggests better alternatives using AI-driven analysis. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Smart Link Validation**: Checks external web links in HTML files with configurable timeout and redirect handling |
| 8 | +- **Enhanced Robustness**: Intelligent detection of bot-blocked sites to reduce false positives |
| 9 | +- **AI-Powered Suggestions**: Provides intelligent recommendations for broken or redirected links |
| 10 | +- **Two Scanning Modes**: Full project scan or PR-specific changed files only |
| 11 | +- **Configurable Status Codes**: Define which HTTP status codes to silently report (e.g., 403, 503) |
| 12 | +- **Redirect Detection**: Identifies and suggests updates for redirected links |
| 13 | +- **GitHub Integration**: Creates issues, PR comments, and workflow artifacts |
| 14 | +- **MyST Markdown Support**: Works with Jupyter Book projects by scanning HTML output |
| 15 | +- **Performance Optimized**: Respectful rate limiting, improved timeouts, and efficient scanning |
| 16 | + |
| 17 | +## Usage |
| 18 | + |
| 19 | +### Basic Usage |
| 20 | + |
| 21 | +```yaml |
| 22 | +- name: Check links in documentation |
| 23 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 24 | +``` |
| 25 | +
|
| 26 | +### Weekly Full Project Scan |
| 27 | +
|
| 28 | +```yaml |
| 29 | +name: Weekly Link Check |
| 30 | +on: |
| 31 | + schedule: |
| 32 | + - cron: '0 9 * * 1' # Monday at 9 AM UTC |
| 33 | + workflow_dispatch: |
| 34 | + |
| 35 | +jobs: |
| 36 | + link-check: |
| 37 | + runs-on: ubuntu-latest |
| 38 | + permissions: |
| 39 | + contents: read |
| 40 | + issues: write |
| 41 | + steps: |
| 42 | + - uses: actions/checkout@v4 |
| 43 | + with: |
| 44 | + ref: gh-pages # Check the published site |
| 45 | + |
| 46 | + - name: AI-powered link check |
| 47 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 48 | + with: |
| 49 | + html-path: '.' |
| 50 | + mode: 'full' |
| 51 | + fail-on-broken: 'false' |
| 52 | + create-issue: 'true' |
| 53 | + ai-suggestions: 'true' |
| 54 | + silent-codes: '403,503' |
| 55 | + issue-title: 'Weekly Link Check Report' |
| 56 | + notify: 'maintainer1,maintainer2' |
| 57 | +``` |
| 58 | +
|
| 59 | +### PR-Triggered Changed Files Only |
| 60 | +
|
| 61 | +```yaml |
| 62 | +name: PR Link Check |
| 63 | +on: |
| 64 | + pull_request: |
| 65 | + branches: [ main ] |
| 66 | + |
| 67 | +jobs: |
| 68 | + link-check: |
| 69 | + runs-on: ubuntu-latest |
| 70 | + permissions: |
| 71 | + contents: read |
| 72 | + pull-requests: write |
| 73 | + steps: |
| 74 | + - uses: actions/checkout@v4 |
| 75 | + |
| 76 | + - name: Build documentation |
| 77 | + run: jupyter-book build . |
| 78 | + |
| 79 | + - name: Check links in changed files |
| 80 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 81 | + with: |
| 82 | + html-path: './_build/html' |
| 83 | + mode: 'changed' |
| 84 | + fail-on-broken: 'true' |
| 85 | + ai-suggestions: 'true' |
| 86 | + silent-codes: '403,503' |
| 87 | +``` |
| 88 | +
|
| 89 | +### Complete Advanced Usage |
| 90 | +
|
| 91 | +```yaml |
| 92 | +- name: Comprehensive link checking |
| 93 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 94 | + with: |
| 95 | + html-path: './_build/html' |
| 96 | + mode: 'full' |
| 97 | + silent-codes: '403,503,429' |
| 98 | + fail-on-broken: 'false' |
| 99 | + ai-suggestions: 'true' |
| 100 | + create-issue: 'true' |
| 101 | + issue-title: 'Link Check Report - Broken Links Found' |
| 102 | + create-artifact: 'true' |
| 103 | + artifact-name: 'detailed-link-report' |
| 104 | + notify: 'team-lead,docs-maintainer' |
| 105 | + timeout: '30' |
| 106 | + max-redirects: '5' |
| 107 | +``` |
| 108 | +
|
| 109 | +## False Positive Reduction |
| 110 | +
|
| 111 | +The action includes intelligent logic to reduce false positives for legitimate sites: |
| 112 | +
|
| 113 | +### Bot Blocking Detection |
| 114 | +- **Major Sites**: Automatically detects common sites that block automated requests (Netflix, Amazon, Facebook, etc.) |
| 115 | +- **Encoding Issues**: Identifies encoding errors that often indicate bot protection |
| 116 | +- **Status Code Analysis**: Recognizes rate limiting (429) and bot blocking patterns |
| 117 | +- **Silent Reporting**: Marks likely bot-blocked sites as silent instead of broken |
| 118 | +
|
| 119 | +### Improved Robustness |
| 120 | +- **Browser-like Headers**: Uses realistic browser headers to reduce blocking |
| 121 | +- **Increased Timeout**: Default 45-second timeout for slow-loading legitimate sites |
| 122 | +- **Smart Error Handling**: Distinguishes between genuine broken links and temporary blocks |
| 123 | +
|
| 124 | +### AI Suggestion Filtering |
| 125 | +- **Constructive Suggestions**: Only suggests fixes, not removals, for legitimate domains |
| 126 | +- **Manual Review**: Suggests manual verification for unknown domains instead of automatic removal |
| 127 | +- **Domain Whitelist**: Recognizes trusted domains (GitHub, Python.org, etc.) and handles them appropriately |
| 128 | +
|
| 129 | +## AI-Powered Suggestions |
| 130 | +
|
| 131 | +The action includes intelligent analysis that can suggest: |
| 132 | +
|
| 133 | +### Automatic Fixes |
| 134 | +- **HTTPS Upgrades**: Detects `http://` links that should be `https://` |
| 135 | +- **GitHub Branch Updates**: Finds `/master/` links that should be `/main/` |
| 136 | +- **Documentation Migrations**: Suggests updated URLs for moved documentation sites |
| 137 | +- **Version Updates**: Recommends newer versions of deprecated documentation |
| 138 | + |
| 139 | +### Redirect Optimization |
| 140 | +- **Final Destination**: Suggests updating redirected links to their final destination |
| 141 | +- **Performance**: Eliminates unnecessary redirect chains |
| 142 | +- **Reliability**: Reduces dependency on redirect services |
| 143 | + |
| 144 | +### Example AI Suggestions Output: |
| 145 | +``` |
| 146 | +🤖 http://docs.python.org/2.7/library/urllib.html |
| 147 | + Issue: Broken link (Status: 404) |
| 148 | + 💡 version_update: https://docs.python.org/3/library/urllib.html |
| 149 | + Reason: Python 2.7 is deprecated, consider Python 3 documentation |
| 150 | + |
| 151 | +🤖 http://github.com/user/repo/blob/master/README.md |
| 152 | + Issue: Redirected 1 times |
| 153 | + 💡 redirect_update: https://github.com/user/repo/blob/main/README.md |
| 154 | + Reason: GitHub default branch changed from master to main |
| 155 | +``` |
| 156 | +
|
| 157 | +## How It Works |
| 158 | +
|
| 159 | +1. **File Discovery**: Scans HTML files in the specified directory |
| 160 | +2. **Link Extraction**: Uses BeautifulSoup to extract all external links |
| 161 | +3. **Link Validation**: Checks each link with configurable timeout and redirect handling |
| 162 | +4. **AI Analysis**: Applies rule-based AI to suggest improvements |
| 163 | +5. **Reporting**: Creates detailed reports with actionable suggestions |
| 164 | +
|
| 165 | +### Scanning Modes |
| 166 | +
|
| 167 | +#### Full Mode (`mode: 'full'`) |
| 168 | +- Scans all HTML files in the target directory |
| 169 | +- Ideal for scheduled weekly scans |
| 170 | +- Comprehensive coverage of entire project |
| 171 | +
|
| 172 | +#### Changed Mode (`mode: 'changed'`) |
| 173 | +- Only scans HTML files that changed in the current PR |
| 174 | +- Efficient for PR-triggered workflows |
| 175 | +- Falls back to full scan if no changes detected |
| 176 | +
|
| 177 | +## Configuration |
| 178 | +
|
| 179 | +### Silent Status Codes |
| 180 | +
|
| 181 | +Configure which HTTP status codes should be reported without failing: |
| 182 | +
|
| 183 | +```yaml |
| 184 | +silent-codes: '403,503,429,502' |
| 185 | +``` |
| 186 | + |
| 187 | +Common codes to consider: |
| 188 | +- `403`: Forbidden (often due to bot detection) |
| 189 | +- `503`: Service Unavailable (temporary outages) |
| 190 | +- `429`: Too Many Requests (rate limiting) |
| 191 | +- `502`: Bad Gateway (temporary server issues) |
| 192 | + |
| 193 | +### Performance Tuning |
| 194 | + |
| 195 | +```yaml |
| 196 | +timeout: '30' # Timeout per link in seconds |
| 197 | +max-redirects: '5' # Maximum redirects to follow |
| 198 | +``` |
| 199 | +
|
| 200 | +## Integration Examples |
| 201 | +
|
| 202 | +### Replacing Lychee |
| 203 | +
|
| 204 | +**Before (using lychee):** |
| 205 | +```yaml |
| 206 | +- name: Link Checker |
| 207 | + uses: lycheeverse/lychee-action@v2 |
| 208 | + with: |
| 209 | + fail: false |
| 210 | + args: --accept 403,503 *.html |
| 211 | +``` |
| 212 | +
|
| 213 | +**After (using AI-powered link checker):** |
| 214 | +```yaml |
| 215 | +- name: AI-Powered Link Checker |
| 216 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 217 | + with: |
| 218 | + html-path: '.' |
| 219 | + fail-on-broken: 'false' |
| 220 | + silent-codes: '403,503' |
| 221 | + ai-suggestions: 'true' |
| 222 | + create-issue: 'true' |
| 223 | +``` |
| 224 | +
|
| 225 | +### MyST Markdown Projects |
| 226 | +
|
| 227 | +For Jupyter Book projects: |
| 228 | +
|
| 229 | +```yaml |
| 230 | +- name: Build Jupyter Book |
| 231 | + run: jupyter-book build lectures/ |
| 232 | + |
| 233 | +- name: Check links in built documentation |
| 234 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 235 | + with: |
| 236 | + html-path: './lectures/_build/html' |
| 237 | + mode: 'full' |
| 238 | + ai-suggestions: 'true' |
| 239 | +``` |
| 240 | +
|
| 241 | +## Outputs |
| 242 | +
|
| 243 | +Use action outputs in subsequent workflow steps: |
| 244 | +
|
| 245 | +```yaml |
| 246 | +- name: Check links |
| 247 | + id: link-check |
| 248 | + uses: QuantEcon/meta/.github/actions/link-checker@main |
| 249 | + with: |
| 250 | + fail-on-broken: 'false' |
| 251 | + |
| 252 | +- name: Report results |
| 253 | + run: | |
| 254 | + echo "Broken links: ${{ steps.link-check.outputs.broken-link-count }}" |
| 255 | + echo "Redirects: ${{ steps.link-check.outputs.redirect-count }}" |
| 256 | + echo "AI suggestions available: ${{ steps.link-check.outputs.ai-suggestions != '' }}" |
| 257 | +``` |
| 258 | +
|
| 259 | +## Permissions |
| 260 | +
|
| 261 | +Required workflow permissions depend on features used: |
| 262 | +
|
| 263 | +```yaml |
| 264 | +permissions: |
| 265 | + contents: read # Always required |
| 266 | + issues: write # For create-issue: 'true' |
| 267 | + pull-requests: write # For PR comments |
| 268 | + actions: read # For create-artifact: 'true' |
| 269 | +``` |
| 270 | +
|
| 271 | +## Inputs |
| 272 | +
|
| 273 | +| Input | Description | Required | Default | |
| 274 | +|-------|-------------|----------|---------| |
| 275 | +| `html-path` | Path to HTML files directory | No | `./_build/html` | |
| 276 | +| `mode` | Scan mode: `full` or `changed` | No | `full` | |
| 277 | +| `silent-codes` | HTTP codes to silently report | No | `403,503` | |
| 278 | +| `fail-on-broken` | Fail workflow on broken links | No | `true` | |
| 279 | +| `ai-suggestions` | Enable AI-powered suggestions | No | `true` | |
| 280 | +| `create-issue` | Create GitHub issue for broken links | No | `false` | |
| 281 | +| `issue-title` | Title for created issues | No | `Broken Links Found in Documentation` | |
| 282 | +| `create-artifact` | Create workflow artifact | No | `false` | |
| 283 | +| `artifact-name` | Name for workflow artifact | No | `link-check-report` | |
| 284 | +| `notify` | Users to assign to created issue | No | `` | |
| 285 | +| `timeout` | Timeout per link (seconds) | No | `45` | |
| 286 | +| `max-redirects` | Maximum redirects to follow | No | `5` | |
| 287 | + |
| 288 | +## Outputs |
| 289 | + |
| 290 | +| Output | Description | |
| 291 | +|--------|-------------| |
| 292 | +| `broken-links-found` | Whether broken links were found | |
| 293 | +| `broken-link-count` | Number of broken links | |
| 294 | +| `redirect-count` | Number of redirects found | |
| 295 | +| `link-details` | Detailed broken link information | |
| 296 | +| `ai-suggestions` | AI-powered improvement suggestions | |
| 297 | +| `issue-url` | URL of created GitHub issue | |
| 298 | +| `artifact-path` | Path to created artifact file | |
| 299 | + |
| 300 | +## Best Practices |
| 301 | + |
| 302 | +1. **Weekly Scans**: Use scheduled workflows for comprehensive link checking |
| 303 | +2. **PR Validation**: Use changed-file mode for efficient PR validation |
| 304 | +3. **Status Code Configuration**: Adjust silent codes based on your links' typical behavior |
| 305 | +4. **AI Suggestions**: Review and apply AI suggestions to improve link quality |
| 306 | +5. **Issue Management**: Use automatic issue creation for tracking broken links |
| 307 | +6. **Performance**: Set appropriate timeouts based on your link destinations |
| 308 | + |
| 309 | +## Troubleshooting |
| 310 | + |
| 311 | +### Common Issues |
| 312 | + |
| 313 | +1. **Timeout Errors**: Increase `timeout` value for slow-responding sites (default is now 45s) |
| 314 | +2. **False Positives**: The action automatically detects major sites that block bots (Netflix, Amazon, etc.) |
| 315 | +3. **Rate Limiting**: Add `429` to `silent-codes` for rate-limited sites |
| 316 | +4. **Bot Blocking**: Legitimate sites blocking automated requests are automatically handled gracefully |
| 317 | +5. **Large Repositories**: Use `changed` mode for PR workflows |
| 318 | + |
| 319 | +### False Positive Mitigation |
| 320 | + |
| 321 | +If legitimate links are being flagged as broken: |
| 322 | + |
| 323 | +1. **Check if it's a major site**: Netflix, Amazon, Facebook, etc. are automatically detected as likely bot-blocked |
| 324 | +2. **Increase timeout**: Use `timeout: '60'` for slower sites like tutorials or educational content |
| 325 | +3. **Add to silent codes**: If a site consistently returns specific error codes, add them to `silent-codes` |
| 326 | +4. **Review AI suggestions**: The action provides constructive fix suggestions rather than suggesting removal |
| 327 | + |
| 328 | +### Debug Output |
| 329 | + |
| 330 | +The action provides detailed logging including: |
| 331 | +- Number of files scanned |
| 332 | +- Links found per file |
| 333 | +- Status codes and errors |
| 334 | +- AI suggestion reasoning |
| 335 | + |
| 336 | +## Migration from Lychee |
| 337 | + |
| 338 | +This action can directly replace `lychee` workflows with enhanced functionality: |
| 339 | + |
| 340 | +1. Replace `lycheeverse/lychee-action` with this action |
| 341 | +2. Update input parameters (see comparison above) |
| 342 | +3. Add AI suggestions and issue creation features |
| 343 | +4. Configure silent status codes as needed |
| 344 | + |
| 345 | +The enhanced AI capabilities provide value beyond basic link checking by suggesting improvements and maintaining link quality over time. |
0 commit comments