Intelligent Web Scraping with LLMs - A TypeScript library that combines traditional web scraping with Large Language Models for intelligent, structured data extraction.
- π€ Dual LLM Support: Ollama (free/local) + Google Gemini (cloud)
- π Graph-Based Architecture: Composable, reusable node pipelines
- π Production-Ready: Built-in caching, retries, rate limiting, and proxy rotation
- π― Smart Parsing: Automatic HTMLβMarkdown conversion and intelligent chunking
- β Schema Validation: Zod integration for type-safe outputs
- π Multiple Formats: JSON, CSV, XML, PDF support
- π Browser Automation: Playwright for dynamic content
- π§ RAG Integration: Retrieval-Augmented Generation for better accuracy
npm install cogniscrapeimport { SmartScraperGraph } from 'cogniscrape';
const scraper = new SmartScraperGraph({
prompt: 'Extract all product names and prices',
source: 'https://example.com/products',
config: {
llm: {
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
},
verbose: true,
},
});
const result = await scraper.run();
console.log(result);import { SmartScraperGraph } from 'cogniscrape';
const scraper = new SmartScraperGraph({
prompt: 'List all article titles and summaries',
source: 'https://news.example.com',
config: {
llm: {
provider: 'ollama',
model: 'llama2', // or 'mistral', 'codellama', etc.
baseUrl: 'http://localhost:11434',
},
},
});
const result = await scraper.run();| Graph | Purpose | Use Case |
|---|---|---|
SmartScraperGraph |
Basic scraping | Extract data from single URL |
SmartScraperMultiGraph |
Multi-URL scraping | Scrape multiple sources (parallel/sequential) |
SearchGraph |
Internet search + scrape | Search engines + content extraction |
DepthSearchGraph |
Deep analysis | Search + reasoning + comprehensive analysis |
CSVScraperGraph |
CSV export | Scrape data β export to CSV |
JSONScraperGraph |
JSON export | Schema-validated JSON output |
import { SmartScraperMultiGraph, createLLM } from 'cogniscrape';
const llm = createLLM({
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
});
const scraper = new SmartScraperMultiGraph(
'Extract company names and descriptions',
[
'https://company1.com',
'https://company2.com',
'https://company3.com',
],
{ llm },
llm,
true // parallel execution
);
const result = await scraper.run();import { CSVScraperGraph } from 'cogniscrape';
import { z } from 'zod';
const schema = z.object({
products: z.array(z.object({
name: z.string(),
price: z.number(),
rating: z.number().optional(),
})),
});
const scraper = new CSVScraperGraph(
'Extract all products with their prices',
'https://shop.example.com',
{
llm: {
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
},
schema,
},
llm,
'products.csv'
);
await scraper.run();import { SearchGraph, createLLM } from 'cogniscrape';
const llm = createLLM({
provider: 'gemini',
model: 'gemini-2.0-flash-exp',
apiKey: process.env.GEMINI_API_KEY,
});
const searchGraph = new SearchGraph(
'Latest news about AI developments in 2026',
{
llm,
searchEngine: 'duckduckgo',
maxDepth: 3,
},
llm
);
const result = await searchGraph.run();interface ScraperConfig {
llm: LLMConfig;
verbose?: boolean; // Enable logging
headless?: boolean; // Headless browser mode
timeout?: number; // Request timeout (ms)
cut?: boolean; // Enable HTML minification
htmlMode?: boolean; // Skip parsing (use raw HTML)
// Production features
proxy?: ProxyConfig; // Proxy configuration
retry?: RetryConfig; // Retry with backoff
rateLimit?: RateLimitConfig; // Rate limiting
cache?: CacheConfig; // Response caching
// Advanced
schema?: any; // Zod schema for validation
additionalInfo?: string; // Extra context for LLM
reasoning?: boolean; // Enable reasoning mode
}const config = {
llm: { /* ... */ },
proxy: {
enabled: true,
proxies: [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
],
},
};const config = {
llm: { /* ... */ },
retry: {
maxRetries: 3,
initialDelay: 1000,
maxDelay: 10000,
backoffMultiplier: 2,
},
};const config = {
llm: { /* ... */ },
rateLimit: {
maxRequests: 10,
windowMs: 1000,
minDelay: 100,
},
};const config = {
llm: { /* ... */ },
cache: {
enabled: true,
ttl: 3600000, // 1 hour
maxSize: 1000,
},
};npm test# Install dependencies
npm install
# Build the project
npm run build
# Watch mode
npm run dev
# Run examples
npx ts-node examples/smart-scraper-gemini.tsOllamaModel- Local LLM supportGeminiModel- Google Gemini integrationcreateLLM(config)- Factory function
SmartScraperGraph- Basic web scrapingSmartScraperMultiGraph- Multi-URL scrapingSearchGraph- Search + scrapeDepthSearchGraph- Deep search with reasoningCSVScraperGraph- Export to CSVJSONScraperGraph- Export to JSON
FetchNode- Fetch contentParseNode- Parse & chunkGenerateAnswerNode- LLM answer generationRAGNode- Retrieval-Augmented GenerationSearchNode- Internet searchMergeNode- Merge resultsPDFScraperNode- PDF extractionXMLScraperNode- XML parsing
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details
- π§ Email: bonderiddhish@gmail.com
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
Made with β€οΈ for the TypeScript community