🕷️ CogniScrape

Intelligent Web Scraping with LLMs - A TypeScript library that combines traditional web scraping with Large Language Models for intelligent, structured data extraction.

✨ Features

🤖 Dual LLM Support: Ollama (free/local) + Google Gemini (cloud)
📊 Graph-Based Architecture: Composable, reusable node pipelines
🚀 Production-Ready: Built-in caching, retries, rate limiting, and proxy rotation
🎯 Smart Parsing: Automatic HTML→Markdown conversion and intelligent chunking
✅ Schema Validation: Zod integration for type-safe outputs
📝 Multiple Formats: JSON, CSV, XML, PDF support
🌐 Browser Automation: Playwright for dynamic content
🧠 RAG Integration: Retrieval-Augmented Generation for better accuracy

📦 Installation

npm install cogniscrape

🚀 Quick Start

Basic Web Scraping with Gemini

import { SmartScraperGraph } from 'cogniscrape';

const scraper = new SmartScraperGraph({
  prompt: 'Extract all product names and prices',
  source: 'https://example.com/products',
  config: {
    llm: {
      provider: 'gemini',
      model: 'gemini-2.0-flash-exp',
      apiKey: process.env.GEMINI_API_KEY,
    },
    verbose: true,
  },
});

const result = await scraper.run();
console.log(result);

Using Ollama (100% Free & Local)

import { SmartScraperGraph } from 'cogniscrape';

const scraper = new SmartScraperGraph({
  prompt: 'List all article titles and summaries',
  source: 'https://news.example.com',
  config: {
    llm: {
      provider: 'ollama',
      model: 'llama2',  // or 'mistral', 'codellama', etc.
      baseUrl: 'http://localhost:11434',
    },
  },
});

const result = await scraper.run();

🎯 Available Graphs

Graph	Purpose	Use Case
`SmartScraperGraph`	Basic scraping	Extract data from single URL
`SmartScraperMultiGraph`	Multi-URL scraping	Scrape multiple sources (parallel/sequential)
`SearchGraph`	Internet search + scrape	Search engines + content extraction
`DepthSearchGraph`	Deep analysis	Search + reasoning + comprehensive analysis
`CSVScraperGraph`	CSV export	Scrape data → export to CSV
`JSONScraperGraph`	JSON export	Schema-validated JSON output

📚 Examples

Multi-URL Scraping

import { SmartScraperMultiGraph, createLLM } from 'cogniscrape';

const llm = createLLM({
  provider: 'gemini',
  model: 'gemini-2.0-flash-exp',
  apiKey: process.env.GEMINI_API_KEY,
});

const scraper = new SmartScraperMultiGraph(
  'Extract company names and descriptions',
  [
    'https://company1.com',
    'https://company2.com',
    'https://company3.com',
  ],
  { llm },
  llm,
  true // parallel execution
);

const result = await scraper.run();

CSV Export with Schema Validation

import { CSVScraperGraph } from 'cogniscrape';
import { z } from 'zod';

const schema = z.object({
  products: z.array(z.object({
    name: z.string(),
    price: z.number(),
    rating: z.number().optional(),
  })),
});

const scraper = new CSVScraperGraph(
  'Extract all products with their prices',
  'https://shop.example.com',
  {
    llm: {
      provider: 'gemini',
      model: 'gemini-2.0-flash-exp',
      apiKey: process.env.GEMINI_API_KEY,
    },
    schema,
  },
  llm,
  'products.csv'
);

await scraper.run();

Internet Search Graph

import { SearchGraph, createLLM } from 'cogniscrape';

const llm = createLLM({
  provider: 'gemini',
  model: 'gemini-2.0-flash-exp',
  apiKey: process.env.GEMINI_API_KEY,
});

const searchGraph = new SearchGraph(
  'Latest news about AI developments in 2026',
  {
    llm,
    searchEngine: 'duckduckgo',
    maxDepth: 3,
  },
  llm
);

const result = await searchGraph.run();

⚙️ Configuration Options

interface ScraperConfig {
  llm: LLMConfig;
  verbose?: boolean;          // Enable logging
  headless?: boolean;         // Headless browser mode
  timeout?: number;           // Request timeout (ms)
  cut?: boolean;              // Enable HTML minification
  htmlMode?: boolean;         // Skip parsing (use raw HTML)
  
  // Production features
  proxy?: ProxyConfig;        // Proxy configuration
  retry?: RetryConfig;        // Retry with backoff
  rateLimit?: RateLimitConfig; // Rate limiting
  cache?: CacheConfig;        // Response caching
  
  // Advanced
  schema?: any;               // Zod schema for validation
  additionalInfo?: string;    // Extra context for LLM
  reasoning?: boolean;        // Enable reasoning mode
}

🔧 Production Features

Proxy Rotation

const config = {
  llm: { /* ... */ },
  proxy: {
    enabled: true,
    proxies: [
      'http://proxy1.com:8080',
      'http://proxy2.com:8080',
    ],
  },
};

Retry with Exponential Backoff

const config = {
  llm: { /* ... */ },
  retry: {
    maxRetries: 3,
    initialDelay: 1000,
    maxDelay: 10000,
    backoffMultiplier: 2,
  },
};

Rate Limiting

const config = {
  llm: { /* ... */ },
  rateLimit: {
    maxRequests: 10,
    windowMs: 1000,
    minDelay: 100,
  },
};

Caching

const config = {
  llm: { /* ... */ },
  cache: {
    enabled: true,
    ttl: 3600000, // 1 hour
    maxSize: 1000,
  },
};

🧪 Testing

npm test

🛠️ Development

# Install dependencies
npm install

# Build the project
npm run build

# Watch mode
npm run dev

# Run examples
npx ts-node examples/smart-scraper-gemini.ts

📖 API Reference

Models

OllamaModel - Local LLM support
GeminiModel - Google Gemini integration
createLLM(config) - Factory function

Graphs

SmartScraperGraph - Basic web scraping
SmartScraperMultiGraph - Multi-URL scraping
SearchGraph - Search + scrape
DepthSearchGraph - Deep search with reasoning
CSVScraperGraph - Export to CSV
JSONScraperGraph - Export to JSON

Nodes

FetchNode - Fetch content
ParseNode - Parse & chunk
GenerateAnswerNode - LLM answer generation
RAGNode - Retrieval-Augmented Generation
SearchNode - Internet search
MergeNode - Merge results
PDFScraperNode - PDF extraction
XMLScraperNode - XML parsing

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

MIT License - see LICENSE file for details

📬 Support

📧 Email: bonderiddhish@gmail.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ for the TypeScript community

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ CogniScrape

✨ Features

📦 Installation

🚀 Quick Start

Basic Web Scraping with Gemini

Using Ollama (100% Free & Local)

🎯 Available Graphs

📚 Examples

Multi-URL Scraping

CSV Export with Schema Validation

Internet Search Graph

⚙️ Configuration Options

🔧 Production Features

Proxy Rotation

Retry with Exponential Backoff

Rate Limiting

Caching

🧪 Testing

🛠️ Development

📖 API Reference

Models

Graphs

Nodes

🤝 Contributing

📄 License

📬 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕷️ CogniScrape

✨ Features

📦 Installation

🚀 Quick Start

Basic Web Scraping with Gemini

Using Ollama (100% Free & Local)

🎯 Available Graphs

📚 Examples

Multi-URL Scraping

CSV Export with Schema Validation

Internet Search Graph

⚙️ Configuration Options

🔧 Production Features

Proxy Rotation

Retry with Exponential Backoff

Rate Limiting

Caching

🧪 Testing

🛠️ Development

📖 API Reference

Models

Graphs

Nodes

🤝 Contributing

📄 License

📬 Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages