A comprehensive system to track when major LLM companies (OpenAI, Google, Anthropic, Perplexity) scrape the Synapticlabs.ai website. The system logs scraper activity, analyzes patterns, and provides analytics through a web dashboard.
- Multi-layered Detection: Identifies LLM scrapers using User-Agent strings, IP ranges, ASN analysis, and reverse DNS verification
- Real-time Tracking: Logs all scraper interactions with detailed metadata
- Analytics Dashboard: Web-based dashboard showing scraping patterns and trends
- Webhook Integration: External API for website integration
- Guided Path Monitoring: Tracks effectiveness of directing scrapers to important content
- Performance Optimized: Batch processing, caching, and efficient database operations
- Backend: Node.js with Express.js
- Database: Supabase (PostgreSQL)
- Hosting: Railway
- Frontend: Vanilla HTML/CSS/JavaScript
- External APIs: ipapi.co for IP geolocation
-
Clone the repository
git clone <repository-url> cd llm-scraper-tracker
-
Install dependencies
npm install
-
Set up environment variables
cp .env.example .env # Edit .env with your configuration -
Set up Supabase database
- Create a new Supabase project
- Run the SQL schema from
scripts/setup-database.sqlin your Supabase SQL editor - Update
.envwith your Supabase credentials
-
Start the application
# Development npm run dev # Production npm start
PORT=3000
NODE_ENV=development
SUPABASE_URL=your_supabase_project_url
SUPABASE_ANON_KEY=your_supabase_anon_key
SUPABASE_SERVICE_ROLE_KEY=your_supabase_service_role_key
IP_API_KEY=optional_ip_api_key
WEBHOOK_SECRET=random_secure_string_for_webhook_authRun the SQL schema in scripts/setup-database.sql to create:
llm_activity- Main tracking tablepage_scraping_stats- Aggregated page statisticsllm_company_stats- Daily company summaries- Indexes and views for performance
- Triggers for automatic statistics updates
GET /api/stats- Comprehensive statisticsGET /api/activity- Recent scraper activityGET /api/companies- List of tracked companiesGET /api/guidance- Guided path effectivenessGET /api/system- System status and performance
POST /webhook/track- Single tracking eventPOST /webhook/batch- Batch tracking eventsPOST /webhook/test- Test webhook functionalityGET /webhook/status- Webhook service status
GET /dashboard- Analytics dashboardGET /health- Health check endpoint
The system can detect scrapers from:
- OpenAI (GPTBot, ChatGPT-User, CCBot)
- Anthropic (ClaudeBot, ANTHROPIC-AI)
- Google (Googlebot, Google-Extended, Bard)
- Perplexity AI (PerplexityBot)
- Meta AI (facebookexternalhit, Meta-ExternalAgent)
- Other LLM crawlers (Various AI and automated crawlers)
- User-Agent Analysis: Matches against known LLM crawler patterns
- IP Range Checking: Verifies against known company IP ranges
- ASN Verification: Checks Autonomous System Numbers
- Reverse DNS: Special verification for Google crawlers
- Organization Analysis: Analyzes IP organization names
curl -X POST http://localhost:3000/api/test \
-H "Content-Type: application/json" \
-d '{"userAgent": "GPTBot/1.0 (+https://openai.com/gptbot)"}'// Send tracking data from your website
fetch('http://your-tracker-url/webhook/track', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
url: window.location.href,
userAgent: navigator.userAgent,
referrer: document.referrer,
timestamp: new Date().toISOString()
})
});curl http://localhost:3000/api/stats?days=30Run the core service tests:
node tests/core.test.jsTest individual components:
# Test detection engine
npm test -- detector
# Test IP analyzer
npm test -- ipAnalyzer
# Test API endpoints
npm test -- api-
Connect to Railway
npm install -g @railway/cli railway login railway init
-
Set environment variables
railway variables set SUPABASE_URL=your_url railway variables set SUPABASE_ANON_KEY=your_key # ... other variables
-
Deploy
railway up
-
Build and start
npm install --production npm start
-
Use PM2 for production
npm install -g pm2 pm2 start src/app.js --name llm-tracker pm2 startup pm2 save
llm-scraper-tracker/
βββ src/
β βββ app.js # Main Express application
β βββ config/
β β βββ database.js # Supabase configuration
β β βββ llm-signatures.js # LLM detection patterns
β βββ middleware/
β β βββ tracker.js # Main tracking middleware
β βββ services/
β β βββ detector.js # LLM detection engine
β β βββ ipAnalyzer.js # IP address analysis
β β βββ logger.js # Database logging service
β βββ routes/
β βββ api.js # API endpoints
β βββ dashboard.js # Dashboard routes
β βββ webhook.js # Webhook endpoints
βββ scripts/
β βββ setup-database.sql # Database schema
βββ tests/
β βββ core.test.js # Core service tests
βββ cline_docs/ # Memory bank documentation
βββ package.json
βββ README.md
- Input Validation: All webhook inputs are validated
- Rate Limiting: Prevents abuse of tracking endpoints
- CORS Protection: Configured for specific origins
- Helmet Security: Security headers enabled
- Error Handling: No sensitive information in error responses
- Async Operations: All I/O operations are non-blocking
- Batch Processing: Groups database writes for efficiency
- Caching: IP analysis results cached for 24 hours
- Selective Tracking: Skips tracking for static assets
- Connection Pooling: Efficient database connections
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
MIT License - see LICENSE file for details
For issues and questions:
- Check the API documentation
- Review the troubleshooting guide
- Open an issue on GitHub
- Initial release
- Multi-layered LLM detection
- Real-time activity tracking
- Analytics dashboard
- Webhook integration
- Comprehensive API