DocHarvest

A powerful CLI tool for scraping documentation websites and converting them to well-structured markdown files.

Features • Installation • Usage • Examples • Configuration • Tech Stack • License

Features

✨ Powerful Scraping: Handles modern websites with client-side rendering using Puppeteer.

📚 Smart Content Extraction: Intelligently identifies and extracts main content areas.

🖼️ Image Processing: Downloads and properly links images in markdown.

📊 Progress Tracking: Real-time progress bar with statistics.

🎨 Beautiful Output: Generates clean, well-formatted markdown with proper headings, links, and code blocks.

🔍 Depth Control: Configure how deep the crawler should go.

🛠️ Highly Configurable: Extensive options for customizing the scraping process.

Installation

From GitHub

Clone the repository and install dependencies:

# Clone the repository
git clone https://github.com/joelm-code/docharvest.git

# Navigate to the project directory
cd docharvest

# Install dependencies
npm install

# Link the package globally (optional)
npm link

This will make the docharvest command available globally on your system.

Usage

DocHarvest can be used in two modes: interactive CLI mode or command-line mode.

Interactive Mode

Simply run the command without arguments to enter interactive mode:

docharvest

This will guide you through a series of prompts to configure your scraping job.

Command-Line Mode

For automation or scripting, use command-line arguments:

docharvest <url> [options]

Basic Options

Option	Description
`-o, --output <dir>`	Output directory (default: `./docs`)
`-d, --delay <ms>`	Delay between requests in milliseconds (default: `1000`)
`-m, --max-depth <n>`	Maximum crawl depth (default: `3`, `0` for unlimited)
`-i, --images`	Download and include images (default: `true`)
`--debug`	Enable debug mode with verbose logging

Advanced Options

Option	Description
`--no-images`	Do not download images
`--headless`	Run browser in headless mode (default: `true`)
`--no-headless`	Run browser in non-headless mode
`--timeout <ms>`	Request timeout in milliseconds (default: `30000`)
`--retries <n>`	Number of retries for failed requests (default: `3`)
`--concurrency <n>`	Maximum concurrent requests (default: `1`)
`--save-html`	Save raw HTML files

Examples

Basic Usage

# Scrape a documentation site with default settings
docharvest https://docs.example.com

Advanced Usage

# Scrape with custom settings
docharvest https://docs.example.com -o ./my-docs -d 2000 -m 5 --no-images

# Scrape a large site with unlimited depth
docharvest https://docs.example.com -m 0 --timeout 60000 --retries 5

Programmatic Usage

You can also use DocHarvest programmatically in your Node.js applications:

const { Scraper } = require('docharvest');

async function scrapeDocumentation() {
  const scraper = new Scraper({
    scraping: {
      delay: 1500,
      maxDepth: 3,
      downloadImages: true
    },
    output: {
      directory: './docs'
    }
  });
  
  await scraper.init('https://docs.example.com', './my-docs');
  await scraper.start();
  
  console.log('Documentation scraped successfully!');
}

scrapeDocumentation().catch(console.error);

Configuration

DocHarvest is highly configurable. Here are the main configuration sections:

Scraping Options

Controls how the scraper behaves when crawling the website.

{
  delay: 1000,           // Delay between requests in milliseconds
  maxDepth: 3,           // Maximum depth to crawl (0 = unlimited)
  downloadImages: true,  // Whether to download images
  timeout: 30000,        // Request timeout in milliseconds
  maxRetries: 3,         // Maximum retries for failed requests
  concurrency: 1,        // Maximum concurrent requests
  respectRobotsTxt: true // Whether to respect robots.txt
}

Browser Options

Controls the Puppeteer browser instance.

{
  headless: true,        // Whether to run in headless mode
  width: 1280,           // Browser width
  height: 800,           // Browser height
  args: [                // Browser launch arguments
    '--no-sandbox',
    '--disable-setuid-sandbox',
    // ...
  ]
}

Output Options

Controls how the markdown files are generated.

{
  directory: './docs',      // Default output directory
  extension: '.md',         // File extension for markdown files
  saveRawHtml: false        // Whether to save raw HTML
}

Tech Stack

DocHarvest is built with the following technologies:

Node.js: JavaScript runtime
Puppeteer: Headless Chrome browser for rendering JavaScript-heavy pages
Cheerio: Fast and flexible HTML parsing
Turndown: HTML to Markdown converter
Commander: Command-line interface
Inquirer: Interactive CLI prompts
Chalk: Terminal string styling
Ora: Elegant terminal spinners
p-retry: Retry failed promises
fs-extra: Enhanced file system methods
cli-progress: Progress bars for CLI applications
figlet: ASCII art text generation

License

Made with ❤️ by Joel Mascarenhas

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
bin		bin
src		src
.gitignore		.gitignore
README.md		README.md
link.js		link.js
package.json		package.json
test.js		test.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocHarvest

Features

Installation

From GitHub

Usage

Interactive Mode

Command-Line Mode

Basic Options

Advanced Options

Examples

Basic Usage

Advanced Usage

Programmatic Usage

Configuration

Scraping Options

Browser Options

Output Options

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

joelm-code/docharvest

Folders and files

Latest commit

History

Repository files navigation

DocHarvest

Features

Installation

From GitHub

Usage

Interactive Mode

Command-Line Mode

Basic Options

Advanced Options

Examples

Basic Usage

Advanced Usage

Programmatic Usage

Configuration

Scraping Options

Browser Options

Output Options

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages