A powerful CLI tool for scraping documentation websites and converting them to well-structured markdown files.
Features • Installation • Usage • Examples • Configuration • Tech Stack • License
✨ Powerful Scraping: Handles modern websites with client-side rendering using Puppeteer.
📚 Smart Content Extraction: Intelligently identifies and extracts main content areas.
🖼️ Image Processing: Downloads and properly links images in markdown.
📊 Progress Tracking: Real-time progress bar with statistics.
🎨 Beautiful Output: Generates clean, well-formatted markdown with proper headings, links, and code blocks.
🔍 Depth Control: Configure how deep the crawler should go.
🛠️ Highly Configurable: Extensive options for customizing the scraping process.
Clone the repository and install dependencies:
# Clone the repository
git clone https://github.com/joelm-code/docharvest.git
# Navigate to the project directory
cd docharvest
# Install dependencies
npm install
# Link the package globally (optional)
npm linkThis will make the docharvest command available globally on your system.
DocHarvest can be used in two modes: interactive CLI mode or command-line mode.
Simply run the command without arguments to enter interactive mode:
docharvestThis will guide you through a series of prompts to configure your scraping job.
For automation or scripting, use command-line arguments:
docharvest <url> [options]| Option | Description |
|---|---|
-o, --output <dir> |
Output directory (default: ./docs) |
-d, --delay <ms> |
Delay between requests in milliseconds (default: 1000) |
-m, --max-depth <n> |
Maximum crawl depth (default: 3, 0 for unlimited) |
-i, --images |
Download and include images (default: true) |
--debug |
Enable debug mode with verbose logging |
| Option | Description |
|---|---|
--no-images |
Do not download images |
--headless |
Run browser in headless mode (default: true) |
--no-headless |
Run browser in non-headless mode |
--timeout <ms> |
Request timeout in milliseconds (default: 30000) |
--retries <n> |
Number of retries for failed requests (default: 3) |
--concurrency <n> |
Maximum concurrent requests (default: 1) |
--save-html |
Save raw HTML files |
# Scrape a documentation site with default settings
docharvest https://docs.example.com# Scrape with custom settings
docharvest https://docs.example.com -o ./my-docs -d 2000 -m 5 --no-images
# Scrape a large site with unlimited depth
docharvest https://docs.example.com -m 0 --timeout 60000 --retries 5You can also use DocHarvest programmatically in your Node.js applications:
const { Scraper } = require('docharvest');
async function scrapeDocumentation() {
const scraper = new Scraper({
scraping: {
delay: 1500,
maxDepth: 3,
downloadImages: true
},
output: {
directory: './docs'
}
});
await scraper.init('https://docs.example.com', './my-docs');
await scraper.start();
console.log('Documentation scraped successfully!');
}
scrapeDocumentation().catch(console.error);DocHarvest is highly configurable. Here are the main configuration sections:
Controls how the scraper behaves when crawling the website.
{
delay: 1000, // Delay between requests in milliseconds
maxDepth: 3, // Maximum depth to crawl (0 = unlimited)
downloadImages: true, // Whether to download images
timeout: 30000, // Request timeout in milliseconds
maxRetries: 3, // Maximum retries for failed requests
concurrency: 1, // Maximum concurrent requests
respectRobotsTxt: true // Whether to respect robots.txt
}Controls the Puppeteer browser instance.
{
headless: true, // Whether to run in headless mode
width: 1280, // Browser width
height: 800, // Browser height
args: [ // Browser launch arguments
'--no-sandbox',
'--disable-setuid-sandbox',
// ...
]
}Controls how the markdown files are generated.
{
directory: './docs', // Default output directory
extension: '.md', // File extension for markdown files
saveRawHtml: false // Whether to save raw HTML
}DocHarvest is built with the following technologies:
- Node.js: JavaScript runtime
- Puppeteer: Headless Chrome browser for rendering JavaScript-heavy pages
- Cheerio: Fast and flexible HTML parsing
- Turndown: HTML to Markdown converter
- Commander: Command-line interface
- Inquirer: Interactive CLI prompts
- Chalk: Terminal string styling
- Ora: Elegant terminal spinners
- p-retry: Retry failed promises
- fs-extra: Enhanced file system methods
- cli-progress: Progress bars for CLI applications
- figlet: ASCII art text generation
MIT © Joel Mascarenhas
Made with ❤️ by Joel Mascarenhas
