Skip to content

Riddhish1/CogniScrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ CogniScrape

npm version License: MIT

Intelligent Web Scraping with LLMs - A TypeScript library that combines traditional web scraping with Large Language Models for intelligent, structured data extraction.

✨ Features

  • πŸ€– Dual LLM Support: Ollama (free/local) + Google Gemini (cloud)
  • πŸ“Š Graph-Based Architecture: Composable, reusable node pipelines
  • πŸš€ Production-Ready: Built-in caching, retries, rate limiting, and proxy rotation
  • 🎯 Smart Parsing: Automatic HTMLβ†’Markdown conversion and intelligent chunking
  • βœ… Schema Validation: Zod integration for type-safe outputs
  • πŸ“ Multiple Formats: JSON, CSV, XML, PDF support
  • 🌐 Browser Automation: Playwright for dynamic content
  • 🧠 RAG Integration: Retrieval-Augmented Generation for better accuracy

πŸ“¦ Installation

npm install cogniscrape

πŸš€ Quick Start

Basic Web Scraping with Gemini

import { SmartScraperGraph } from 'cogniscrape';

const scraper = new SmartScraperGraph({
  prompt: 'Extract all product names and prices',
  source: 'https://example.com/products',
  config: {
    llm: {
      provider: 'gemini',
      model: 'gemini-2.0-flash-exp',
      apiKey: process.env.GEMINI_API_KEY,
    },
    verbose: true,
  },
});

const result = await scraper.run();
console.log(result);

Using Ollama (100% Free & Local)

import { SmartScraperGraph } from 'cogniscrape';

const scraper = new SmartScraperGraph({
  prompt: 'List all article titles and summaries',
  source: 'https://news.example.com',
  config: {
    llm: {
      provider: 'ollama',
      model: 'llama2',  // or 'mistral', 'codellama', etc.
      baseUrl: 'http://localhost:11434',
    },
  },
});

const result = await scraper.run();

🎯 Available Graphs

Graph Purpose Use Case
SmartScraperGraph Basic scraping Extract data from single URL
SmartScraperMultiGraph Multi-URL scraping Scrape multiple sources (parallel/sequential)
SearchGraph Internet search + scrape Search engines + content extraction
DepthSearchGraph Deep analysis Search + reasoning + comprehensive analysis
CSVScraperGraph CSV export Scrape data β†’ export to CSV
JSONScraperGraph JSON export Schema-validated JSON output

πŸ“š Examples

Multi-URL Scraping

import { SmartScraperMultiGraph, createLLM } from 'cogniscrape';

const llm = createLLM({
  provider: 'gemini',
  model: 'gemini-2.0-flash-exp',
  apiKey: process.env.GEMINI_API_KEY,
});

const scraper = new SmartScraperMultiGraph(
  'Extract company names and descriptions',
  [
    'https://company1.com',
    'https://company2.com',
    'https://company3.com',
  ],
  { llm },
  llm,
  true // parallel execution
);

const result = await scraper.run();

CSV Export with Schema Validation

import { CSVScraperGraph } from 'cogniscrape';
import { z } from 'zod';

const schema = z.object({
  products: z.array(z.object({
    name: z.string(),
    price: z.number(),
    rating: z.number().optional(),
  })),
});

const scraper = new CSVScraperGraph(
  'Extract all products with their prices',
  'https://shop.example.com',
  {
    llm: {
      provider: 'gemini',
      model: 'gemini-2.0-flash-exp',
      apiKey: process.env.GEMINI_API_KEY,
    },
    schema,
  },
  llm,
  'products.csv'
);

await scraper.run();

Internet Search Graph

import { SearchGraph, createLLM } from 'cogniscrape';

const llm = createLLM({
  provider: 'gemini',
  model: 'gemini-2.0-flash-exp',
  apiKey: process.env.GEMINI_API_KEY,
});

const searchGraph = new SearchGraph(
  'Latest news about AI developments in 2026',
  {
    llm,
    searchEngine: 'duckduckgo',
    maxDepth: 3,
  },
  llm
);

const result = await searchGraph.run();

βš™οΈ Configuration Options

interface ScraperConfig {
  llm: LLMConfig;
  verbose?: boolean;          // Enable logging
  headless?: boolean;         // Headless browser mode
  timeout?: number;           // Request timeout (ms)
  cut?: boolean;              // Enable HTML minification
  htmlMode?: boolean;         // Skip parsing (use raw HTML)
  
  // Production features
  proxy?: ProxyConfig;        // Proxy configuration
  retry?: RetryConfig;        // Retry with backoff
  rateLimit?: RateLimitConfig; // Rate limiting
  cache?: CacheConfig;        // Response caching
  
  // Advanced
  schema?: any;               // Zod schema for validation
  additionalInfo?: string;    // Extra context for LLM
  reasoning?: boolean;        // Enable reasoning mode
}

πŸ”§ Production Features

Proxy Rotation

const config = {
  llm: { /* ... */ },
  proxy: {
    enabled: true,
    proxies: [
      'http://proxy1.com:8080',
      'http://proxy2.com:8080',
    ],
  },
};

Retry with Exponential Backoff

const config = {
  llm: { /* ... */ },
  retry: {
    maxRetries: 3,
    initialDelay: 1000,
    maxDelay: 10000,
    backoffMultiplier: 2,
  },
};

Rate Limiting

const config = {
  llm: { /* ... */ },
  rateLimit: {
    maxRequests: 10,
    windowMs: 1000,
    minDelay: 100,
  },
};

Caching

const config = {
  llm: { /* ... */ },
  cache: {
    enabled: true,
    ttl: 3600000, // 1 hour
    maxSize: 1000,
  },
};

πŸ§ͺ Testing

npm test

πŸ› οΈ Development

# Install dependencies
npm install

# Build the project
npm run build

# Watch mode
npm run dev

# Run examples
npx ts-node examples/smart-scraper-gemini.ts

πŸ“– API Reference

Models

  • OllamaModel - Local LLM support
  • GeminiModel - Google Gemini integration
  • createLLM(config) - Factory function

Graphs

  • SmartScraperGraph - Basic web scraping
  • SmartScraperMultiGraph - Multi-URL scraping
  • SearchGraph - Search + scrape
  • DepthSearchGraph - Deep search with reasoning
  • CSVScraperGraph - Export to CSV
  • JSONScraperGraph - Export to JSON

Nodes

  • FetchNode - Fetch content
  • ParseNode - Parse & chunk
  • GenerateAnswerNode - LLM answer generation
  • RAGNode - Retrieval-Augmented Generation
  • SearchNode - Internet search
  • MergeNode - Merge results
  • PDFScraperNode - PDF extraction
  • XMLScraperNode - XML parsing

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

MIT License - see LICENSE file for details

πŸ“¬ Support


Made with ❀️ for the TypeScript community

About

Intelligent Web Scraping Library with LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors