Skip to content

Commit 740efcd

Browse files
Copilotmmcky
andauthored
Add enhanced AI-powered link checker GitHub action with robust false positive reduction and merge conflict resolution (#196)
* Initial plan * Implement AI-powered link checker action with comprehensive functionality Co-authored-by: mmcky <[email protected]> * Fix YAML syntax and simplify action architecture with separate Python scripts Co-authored-by: mmcky <[email protected]> * Enhance link checker robustness to reduce false positives for legitimate sites Co-authored-by: mmcky <[email protected]> * Resolve merge conflicts with main branch - include both link-checker and weekly-report actions Co-authored-by: mmcky <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: mmcky <[email protected]>
1 parent 29678de commit 740efcd

File tree

14 files changed

+1883
-1
lines changed

14 files changed

+1883
-1
lines changed
Lines changed: 345 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,345 @@
1+
# AI-Powered Link Checker Action
2+
3+
This GitHub Action scans HTML files for web links and validates them, providing AI-powered suggestions for improvements. It's designed to replace traditional link checkers like `lychee` with enhanced functionality that not only detects broken links but also suggests better alternatives using AI-driven analysis.
4+
5+
## Features
6+
7+
- **Smart Link Validation**: Checks external web links in HTML files with configurable timeout and redirect handling
8+
- **Enhanced Robustness**: Intelligent detection of bot-blocked sites to reduce false positives
9+
- **AI-Powered Suggestions**: Provides intelligent recommendations for broken or redirected links
10+
- **Two Scanning Modes**: Full project scan or PR-specific changed files only
11+
- **Configurable Status Codes**: Define which HTTP status codes to silently report (e.g., 403, 503)
12+
- **Redirect Detection**: Identifies and suggests updates for redirected links
13+
- **GitHub Integration**: Creates issues, PR comments, and workflow artifacts
14+
- **MyST Markdown Support**: Works with Jupyter Book projects by scanning HTML output
15+
- **Performance Optimized**: Respectful rate limiting, improved timeouts, and efficient scanning
16+
17+
## Usage
18+
19+
### Basic Usage
20+
21+
```yaml
22+
- name: Check links in documentation
23+
uses: QuantEcon/meta/.github/actions/link-checker@main
24+
```
25+
26+
### Weekly Full Project Scan
27+
28+
```yaml
29+
name: Weekly Link Check
30+
on:
31+
schedule:
32+
- cron: '0 9 * * 1' # Monday at 9 AM UTC
33+
workflow_dispatch:
34+
35+
jobs:
36+
link-check:
37+
runs-on: ubuntu-latest
38+
permissions:
39+
contents: read
40+
issues: write
41+
steps:
42+
- uses: actions/checkout@v4
43+
with:
44+
ref: gh-pages # Check the published site
45+
46+
- name: AI-powered link check
47+
uses: QuantEcon/meta/.github/actions/link-checker@main
48+
with:
49+
html-path: '.'
50+
mode: 'full'
51+
fail-on-broken: 'false'
52+
create-issue: 'true'
53+
ai-suggestions: 'true'
54+
silent-codes: '403,503'
55+
issue-title: 'Weekly Link Check Report'
56+
notify: 'maintainer1,maintainer2'
57+
```
58+
59+
### PR-Triggered Changed Files Only
60+
61+
```yaml
62+
name: PR Link Check
63+
on:
64+
pull_request:
65+
branches: [ main ]
66+
67+
jobs:
68+
link-check:
69+
runs-on: ubuntu-latest
70+
permissions:
71+
contents: read
72+
pull-requests: write
73+
steps:
74+
- uses: actions/checkout@v4
75+
76+
- name: Build documentation
77+
run: jupyter-book build .
78+
79+
- name: Check links in changed files
80+
uses: QuantEcon/meta/.github/actions/link-checker@main
81+
with:
82+
html-path: './_build/html'
83+
mode: 'changed'
84+
fail-on-broken: 'true'
85+
ai-suggestions: 'true'
86+
silent-codes: '403,503'
87+
```
88+
89+
### Complete Advanced Usage
90+
91+
```yaml
92+
- name: Comprehensive link checking
93+
uses: QuantEcon/meta/.github/actions/link-checker@main
94+
with:
95+
html-path: './_build/html'
96+
mode: 'full'
97+
silent-codes: '403,503,429'
98+
fail-on-broken: 'false'
99+
ai-suggestions: 'true'
100+
create-issue: 'true'
101+
issue-title: 'Link Check Report - Broken Links Found'
102+
create-artifact: 'true'
103+
artifact-name: 'detailed-link-report'
104+
notify: 'team-lead,docs-maintainer'
105+
timeout: '30'
106+
max-redirects: '5'
107+
```
108+
109+
## False Positive Reduction
110+
111+
The action includes intelligent logic to reduce false positives for legitimate sites:
112+
113+
### Bot Blocking Detection
114+
- **Major Sites**: Automatically detects common sites that block automated requests (Netflix, Amazon, Facebook, etc.)
115+
- **Encoding Issues**: Identifies encoding errors that often indicate bot protection
116+
- **Status Code Analysis**: Recognizes rate limiting (429) and bot blocking patterns
117+
- **Silent Reporting**: Marks likely bot-blocked sites as silent instead of broken
118+
119+
### Improved Robustness
120+
- **Browser-like Headers**: Uses realistic browser headers to reduce blocking
121+
- **Increased Timeout**: Default 45-second timeout for slow-loading legitimate sites
122+
- **Smart Error Handling**: Distinguishes between genuine broken links and temporary blocks
123+
124+
### AI Suggestion Filtering
125+
- **Constructive Suggestions**: Only suggests fixes, not removals, for legitimate domains
126+
- **Manual Review**: Suggests manual verification for unknown domains instead of automatic removal
127+
- **Domain Whitelist**: Recognizes trusted domains (GitHub, Python.org, etc.) and handles them appropriately
128+
129+
## AI-Powered Suggestions
130+
131+
The action includes intelligent analysis that can suggest:
132+
133+
### Automatic Fixes
134+
- **HTTPS Upgrades**: Detects `http://` links that should be `https://`
135+
- **GitHub Branch Updates**: Finds `/master/` links that should be `/main/`
136+
- **Documentation Migrations**: Suggests updated URLs for moved documentation sites
137+
- **Version Updates**: Recommends newer versions of deprecated documentation
138+
139+
### Redirect Optimization
140+
- **Final Destination**: Suggests updating redirected links to their final destination
141+
- **Performance**: Eliminates unnecessary redirect chains
142+
- **Reliability**: Reduces dependency on redirect services
143+
144+
### Example AI Suggestions Output:
145+
```
146+
🤖 http://docs.python.org/2.7/library/urllib.html
147+
Issue: Broken link (Status: 404)
148+
💡 version_update: https://docs.python.org/3/library/urllib.html
149+
Reason: Python 2.7 is deprecated, consider Python 3 documentation
150+
151+
🤖 http://github.com/user/repo/blob/master/README.md
152+
Issue: Redirected 1 times
153+
💡 redirect_update: https://github.com/user/repo/blob/main/README.md
154+
Reason: GitHub default branch changed from master to main
155+
```
156+
157+
## How It Works
158+
159+
1. **File Discovery**: Scans HTML files in the specified directory
160+
2. **Link Extraction**: Uses BeautifulSoup to extract all external links
161+
3. **Link Validation**: Checks each link with configurable timeout and redirect handling
162+
4. **AI Analysis**: Applies rule-based AI to suggest improvements
163+
5. **Reporting**: Creates detailed reports with actionable suggestions
164+
165+
### Scanning Modes
166+
167+
#### Full Mode (`mode: 'full'`)
168+
- Scans all HTML files in the target directory
169+
- Ideal for scheduled weekly scans
170+
- Comprehensive coverage of entire project
171+
172+
#### Changed Mode (`mode: 'changed'`)
173+
- Only scans HTML files that changed in the current PR
174+
- Efficient for PR-triggered workflows
175+
- Falls back to full scan if no changes detected
176+
177+
## Configuration
178+
179+
### Silent Status Codes
180+
181+
Configure which HTTP status codes should be reported without failing:
182+
183+
```yaml
184+
silent-codes: '403,503,429,502'
185+
```
186+
187+
Common codes to consider:
188+
- `403`: Forbidden (often due to bot detection)
189+
- `503`: Service Unavailable (temporary outages)
190+
- `429`: Too Many Requests (rate limiting)
191+
- `502`: Bad Gateway (temporary server issues)
192+
193+
### Performance Tuning
194+
195+
```yaml
196+
timeout: '30' # Timeout per link in seconds
197+
max-redirects: '5' # Maximum redirects to follow
198+
```
199+
200+
## Integration Examples
201+
202+
### Replacing Lychee
203+
204+
**Before (using lychee):**
205+
```yaml
206+
- name: Link Checker
207+
uses: lycheeverse/lychee-action@v2
208+
with:
209+
fail: false
210+
args: --accept 403,503 *.html
211+
```
212+
213+
**After (using AI-powered link checker):**
214+
```yaml
215+
- name: AI-Powered Link Checker
216+
uses: QuantEcon/meta/.github/actions/link-checker@main
217+
with:
218+
html-path: '.'
219+
fail-on-broken: 'false'
220+
silent-codes: '403,503'
221+
ai-suggestions: 'true'
222+
create-issue: 'true'
223+
```
224+
225+
### MyST Markdown Projects
226+
227+
For Jupyter Book projects:
228+
229+
```yaml
230+
- name: Build Jupyter Book
231+
run: jupyter-book build lectures/
232+
233+
- name: Check links in built documentation
234+
uses: QuantEcon/meta/.github/actions/link-checker@main
235+
with:
236+
html-path: './lectures/_build/html'
237+
mode: 'full'
238+
ai-suggestions: 'true'
239+
```
240+
241+
## Outputs
242+
243+
Use action outputs in subsequent workflow steps:
244+
245+
```yaml
246+
- name: Check links
247+
id: link-check
248+
uses: QuantEcon/meta/.github/actions/link-checker@main
249+
with:
250+
fail-on-broken: 'false'
251+
252+
- name: Report results
253+
run: |
254+
echo "Broken links: ${{ steps.link-check.outputs.broken-link-count }}"
255+
echo "Redirects: ${{ steps.link-check.outputs.redirect-count }}"
256+
echo "AI suggestions available: ${{ steps.link-check.outputs.ai-suggestions != '' }}"
257+
```
258+
259+
## Permissions
260+
261+
Required workflow permissions depend on features used:
262+
263+
```yaml
264+
permissions:
265+
contents: read # Always required
266+
issues: write # For create-issue: 'true'
267+
pull-requests: write # For PR comments
268+
actions: read # For create-artifact: 'true'
269+
```
270+
271+
## Inputs
272+
273+
| Input | Description | Required | Default |
274+
|-------|-------------|----------|---------|
275+
| `html-path` | Path to HTML files directory | No | `./_build/html` |
276+
| `mode` | Scan mode: `full` or `changed` | No | `full` |
277+
| `silent-codes` | HTTP codes to silently report | No | `403,503` |
278+
| `fail-on-broken` | Fail workflow on broken links | No | `true` |
279+
| `ai-suggestions` | Enable AI-powered suggestions | No | `true` |
280+
| `create-issue` | Create GitHub issue for broken links | No | `false` |
281+
| `issue-title` | Title for created issues | No | `Broken Links Found in Documentation` |
282+
| `create-artifact` | Create workflow artifact | No | `false` |
283+
| `artifact-name` | Name for workflow artifact | No | `link-check-report` |
284+
| `notify` | Users to assign to created issue | No | `` |
285+
| `timeout` | Timeout per link (seconds) | No | `45` |
286+
| `max-redirects` | Maximum redirects to follow | No | `5` |
287+
288+
## Outputs
289+
290+
| Output | Description |
291+
|--------|-------------|
292+
| `broken-links-found` | Whether broken links were found |
293+
| `broken-link-count` | Number of broken links |
294+
| `redirect-count` | Number of redirects found |
295+
| `link-details` | Detailed broken link information |
296+
| `ai-suggestions` | AI-powered improvement suggestions |
297+
| `issue-url` | URL of created GitHub issue |
298+
| `artifact-path` | Path to created artifact file |
299+
300+
## Best Practices
301+
302+
1. **Weekly Scans**: Use scheduled workflows for comprehensive link checking
303+
2. **PR Validation**: Use changed-file mode for efficient PR validation
304+
3. **Status Code Configuration**: Adjust silent codes based on your links' typical behavior
305+
4. **AI Suggestions**: Review and apply AI suggestions to improve link quality
306+
5. **Issue Management**: Use automatic issue creation for tracking broken links
307+
6. **Performance**: Set appropriate timeouts based on your link destinations
308+
309+
## Troubleshooting
310+
311+
### Common Issues
312+
313+
1. **Timeout Errors**: Increase `timeout` value for slow-responding sites (default is now 45s)
314+
2. **False Positives**: The action automatically detects major sites that block bots (Netflix, Amazon, etc.)
315+
3. **Rate Limiting**: Add `429` to `silent-codes` for rate-limited sites
316+
4. **Bot Blocking**: Legitimate sites blocking automated requests are automatically handled gracefully
317+
5. **Large Repositories**: Use `changed` mode for PR workflows
318+
319+
### False Positive Mitigation
320+
321+
If legitimate links are being flagged as broken:
322+
323+
1. **Check if it's a major site**: Netflix, Amazon, Facebook, etc. are automatically detected as likely bot-blocked
324+
2. **Increase timeout**: Use `timeout: '60'` for slower sites like tutorials or educational content
325+
3. **Add to silent codes**: If a site consistently returns specific error codes, add them to `silent-codes`
326+
4. **Review AI suggestions**: The action provides constructive fix suggestions rather than suggesting removal
327+
328+
### Debug Output
329+
330+
The action provides detailed logging including:
331+
- Number of files scanned
332+
- Links found per file
333+
- Status codes and errors
334+
- AI suggestion reasoning
335+
336+
## Migration from Lychee
337+
338+
This action can directly replace `lychee` workflows with enhanced functionality:
339+
340+
1. Replace `lycheeverse/lychee-action` with this action
341+
2. Update input parameters (see comparison above)
342+
3. Add AI suggestions and issue creation features
343+
4. Configure silent status codes as needed
344+
345+
The enhanced AI capabilities provide value beyond basic link checking by suggesting improvements and maintaining link quality over time.
10.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)