Skip to content

joelm-code/docharvest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocHarvest

DocHarvest Logo

A powerful CLI tool for scraping documentation websites and converting them to well-structured markdown files.

FeaturesInstallationUsageExamplesConfigurationTech StackLicense


Features

Powerful Scraping: Handles modern websites with client-side rendering using Puppeteer.

📚 Smart Content Extraction: Intelligently identifies and extracts main content areas.

🖼️ Image Processing: Downloads and properly links images in markdown.

📊 Progress Tracking: Real-time progress bar with statistics.

🎨 Beautiful Output: Generates clean, well-formatted markdown with proper headings, links, and code blocks.

🔍 Depth Control: Configure how deep the crawler should go.

🛠️ Highly Configurable: Extensive options for customizing the scraping process.

Installation

From GitHub

Clone the repository and install dependencies:

# Clone the repository
git clone https://github.com/joelm-code/docharvest.git

# Navigate to the project directory
cd docharvest

# Install dependencies
npm install

# Link the package globally (optional)
npm link

This will make the docharvest command available globally on your system.

Usage

DocHarvest can be used in two modes: interactive CLI mode or command-line mode.

Interactive Mode

Simply run the command without arguments to enter interactive mode:

docharvest

This will guide you through a series of prompts to configure your scraping job.

Command-Line Mode

For automation or scripting, use command-line arguments:

docharvest <url> [options]

Basic Options

Option Description
-o, --output <dir> Output directory (default: ./docs)
-d, --delay <ms> Delay between requests in milliseconds (default: 1000)
-m, --max-depth <n> Maximum crawl depth (default: 3, 0 for unlimited)
-i, --images Download and include images (default: true)
--debug Enable debug mode with verbose logging

Advanced Options

Option Description
--no-images Do not download images
--headless Run browser in headless mode (default: true)
--no-headless Run browser in non-headless mode
--timeout <ms> Request timeout in milliseconds (default: 30000)
--retries <n> Number of retries for failed requests (default: 3)
--concurrency <n> Maximum concurrent requests (default: 1)
--save-html Save raw HTML files

Examples

Basic Usage

# Scrape a documentation site with default settings
docharvest https://docs.example.com

Advanced Usage

# Scrape with custom settings
docharvest https://docs.example.com -o ./my-docs -d 2000 -m 5 --no-images

# Scrape a large site with unlimited depth
docharvest https://docs.example.com -m 0 --timeout 60000 --retries 5

Programmatic Usage

You can also use DocHarvest programmatically in your Node.js applications:

const { Scraper } = require('docharvest');

async function scrapeDocumentation() {
  const scraper = new Scraper({
    scraping: {
      delay: 1500,
      maxDepth: 3,
      downloadImages: true
    },
    output: {
      directory: './docs'
    }
  });
  
  await scraper.init('https://docs.example.com', './my-docs');
  await scraper.start();
  
  console.log('Documentation scraped successfully!');
}

scrapeDocumentation().catch(console.error);

Configuration

DocHarvest is highly configurable. Here are the main configuration sections:

Scraping Options

Controls how the scraper behaves when crawling the website.

{
  delay: 1000,           // Delay between requests in milliseconds
  maxDepth: 3,           // Maximum depth to crawl (0 = unlimited)
  downloadImages: true,  // Whether to download images
  timeout: 30000,        // Request timeout in milliseconds
  maxRetries: 3,         // Maximum retries for failed requests
  concurrency: 1,        // Maximum concurrent requests
  respectRobotsTxt: true // Whether to respect robots.txt
}

Browser Options

Controls the Puppeteer browser instance.

{
  headless: true,        // Whether to run in headless mode
  width: 1280,           // Browser width
  height: 800,           // Browser height
  args: [                // Browser launch arguments
    '--no-sandbox',
    '--disable-setuid-sandbox',
    // ...
  ]
}

Output Options

Controls how the markdown files are generated.

{
  directory: './docs',      // Default output directory
  extension: '.md',         // File extension for markdown files
  saveRawHtml: false        // Whether to save raw HTML
}

Tech Stack

DocHarvest is built with the following technologies:

  • Node.js: JavaScript runtime
  • Puppeteer: Headless Chrome browser for rendering JavaScript-heavy pages
  • Cheerio: Fast and flexible HTML parsing
  • Turndown: HTML to Markdown converter
  • Commander: Command-line interface
  • Inquirer: Interactive CLI prompts
  • Chalk: Terminal string styling
  • Ora: Elegant terminal spinners
  • p-retry: Retry failed promises
  • fs-extra: Enhanced file system methods
  • cli-progress: Progress bars for CLI applications
  • figlet: ASCII art text generation

License

MIT © Joel Mascarenhas


Made with ❤️ by Joel Mascarenhas

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published