JavaScript Web Scraping

This document contains a list of libraries and resources for web scraping in JavaScript.

Libraries

Note: All selected libraries are actively maintained.

Network

HTTP Clients

fetch: A built-in Node.js browser-compatible implementation of the fetch() function
axios: A Promise-based HTTP client for the browser and Node.js [Axios Proxy Integration]
node-fetch: A light-weight module that brings the Fetch API to Node.js [node-fetch Proxy Integration]
undici: An HTTP/1.1 client, written from scratch for Node.js
superagent: A small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features [SuperAgent Proxy Integration]
urllib: A library to request HTTP(s) URLs in a complex world
node-libcurl: libcurl bindings for Node.js
got: A human-friendly and powerful HTTP request library for Node.js
bent: A functional HTTP client for Node.js w/ async/await
needle: A nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support

WebSockets

ws: A simple to use, blazing fast and thoroughly tested WebSocket client and server for Node.js
WebScoket-Node: A WebSocket implementation for Node.JS (Draft -08 through the final RFC 6455)

Low Level

node:net: A built-in Node.js module that provides an asynchronous network API for creating stream-based TCP or IPC servers and clients
multicast-dns: A low-level multicast DNS implementation in pure JavaScript
node-ip: IP address tools for node.js

Other

wreck: HTTP client utilities for the hapi web framework
proxy-chain: A Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining

Parsers

HTML/XML Parsers

cheerio: A fast, flexible, and elegant library for parsing and manipulating HTML and XML
fast-xml-parser: A library to validate XML, parse XML, and build XML rapidly without C/C++ based libraries and no callback
node-html-parser: A very fast HTML parser, generating a simplified DOM, with basic element query support
html-dom-parser: An HTML to DOM parser
parse5: An HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant
htmlparser2: A fast and forgiving HTML and XML parser
sax-js: A sax style parser for JS

URL Parsers

node:url: A built-in Node.js module that provides utilities for URL resolution and parsing
query-string: A library to parse and stringify URL query strings.
URI.js: A Javascript URL mutation library

CSV Parsers

csv-parse: A parser converting CSV text input into arrays or objects
fast-csv: A CSV parser and formatter for Node.js

PDF Parsers

pdf-parse: A pure JavaScript cross-platform module to extract texts from PDFs
pdf2json: A library that converts binary PDF to JSON and text, for server-side PDF processing and command-line use

HTTP Parsers

http-parser-js: A pure JS HTTP parser for Node.js

Email Parsers

email-reply-parser: A Node.js library for parsing plain text email content
email-forward-parser: A library that parses forwarded emails and extracts original content
node-address-rfc2822: A parser for RFC2822 (Header) format email addresses
smtp-address-parser: A library to parse an SMTP (RFC-5321) address

Markdown Parsers

markdown-it: A Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
marked: A Markdown parser and compiler. Built for speed
micromark: A small, safe, and great commonmark (optionally gfm) compliant Markdown parser

YAML Parsers

yaml: A YAML parser and stringifier for JavaScript
js-yaml: A JavaScript YAML parser and dumper. Very fast
yaml-eslint-parser: A YAML parser that produces output compatible with ESLint

SQL Parsers

node-sql-parser: Parse simple SQL statements into an abstract syntax tree (AST) with the visited tableList and convert it back to SQL
js-sql-parser: An SQL parser written with jison. Parses SQL into abstract syntax tree (AST) and stringifies back to SQL

Office File Parsers

xlsx: A spreadsheet data parser and writer
exceljs: A library to read, manipulate, and write spreadsheet data and styles to XLSX and JSON
docx: A library to easily generate and modify .docx files with JS/TS with a nice declarative API. Works for Node.js and on the browser

Other

robots-parser: A Node.js robots.txt parser with support for wildcard (*) matching
sitemapper: A parser for XML Sitemaps to be used with robots.txt and web crawlers
ip-address: A library for parsing and manipulating IPv4 and IPv6 addresses in JavaScript
feed-extractor: Simplest way to read & normalize RSS/ATOM/JSON feed data
sanitize-html: A library to Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
parse-css: A standards-based CSS parser
js-xss: A library to sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist

Web Scraping

Frameworks

crawlee: A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation
node-crawler: A web crawler/spider for Node.JS + server-side jQuery
ayakashi: A next generation web scraping framework
webster: A reliable high-level web crawling & scraping framework for Node.js

Anti-Bot Bypass

node-curl-impersonate: A library that allows you to use curl-impersonate natively
cloudflare-scraper: A package to bypass Cloudflare's protection
unblocker: A web proxy for evading internet censorship, and general-purpose Node.js library for proxying and rewriting remote webpages

Proxy Integration

Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]

CAPTCHA Solving

CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
nopecha-nodejs: An automated CAPTCHA solver for Node.js
2captcha: A wrapper around the 2Captcha API
2captcha-javascript: A JavaScript library for easy integration with the API of 2captcha captcha solving service to bypass reCAPTCHA, hCaptcha, funcaptcha, geetest and solve any other CAPTCHAs

User-Agent Spoofing

user-agents: A JavaScript library for generating random user agents with data that's updated daily

Web Automation

Browser Automation Frameworks

puppeteer: A JavaScript library which provides a high-level API to control Chrome or Firefox over the DevTools Protocol or WebDriver BiDi [Puppeteer Proxy Integration]
playwright: A framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API [Playwright Proxy Integration]
selenium: A browser automation framework and ecosystem [Selenium Proxy Integration]
cypress: A fast, easy and reliable testing for anything that runs in a browser
webdriverio: A next-gen browser and mobile automation test framework for Node.js
wendigo: A proper monster for front-end automated testing

Tools and Plugins

puppeteer-extra-plugin-stealth: A plugin for puppeteer-extra and playwright-extra to prevent detection
puppeteer-extra-plugin-anonymize-ua: A plugin for puppeteer-extra and playwright-extra to anonymize the User-Agent on all pages
@extra/proxy-router: A plugin for playwright-extra and puppeteer-extra to route proxies dynamically
puppeteer-extra-plugin-block-resources: A plugin for playwright-extra and puppeteer-extra to block resources (images, media, css, etc.)

Other

playwright-extra: A modular plugin framework for playwright to enable cool plugins through a clean interface
puppeteer-extra: A library to teach puppeteer new tricks through plugins

Data Export

JSON

JSON: A built-in JavaScript namespace that contains static methods for parsing values from and converting values to JavaScript Object Notation

CSV

csv-generate: A flexible generator of random CSV strings and JavaScript objects implementing the Node.js stream.Readable API

Other

protobuf.js: Protocol buffers for JavaScript & TypeScript
jBinary: A high-level API for working with binary data

Data Processing

Character Encoding

encoding.js: A librarry to convert and detect character encoding in JavaScript
chardet: A character encoding detection tool for Node.js
iconv-lite: A library to convert character encodings in pure JavaScript

Date and Time

dayjs: A 2kB immutable date-time library alternative to Moment.js with the same modern API
date-fns: A modern JavaScript date utility library
luxon: A library for working with dates and times in JS

Prices

money.js: A tiny (1kb) javascript currency conversion library, for web & Node.js
currency.js: A JavaScript library for handling currencies

Phone Numbers

libphonenumber-js: A simpler (and smaller) rewrite of Google Android's libphonenumber library in JavaScript
phone: A library to validate and reformat the mobile phone number to the E.164 standard

Slugs

unique-slug: A library that slugifies even UTF-8 characters
unique-slug: A library to generate a unique character string suitible for use in files and URLs

Languages

remove-accents: A library that removes the accents from a string, converting them to their non-accented corresponding characters
nodejieba: A library that provides Chinese word segmentation for Node.js

Other

Task Scheduling

node-schedule: A cron-like and not-cron-like job scheduler for Node
node-cron: A simple cron-like job scheduler for Node.js
bree: A Node.js and JavaScript job task scheduler with worker threads, cron, Date, and human syntax
cron: A robust tool for running jobs (functions or commands) on schedules defined using the cron syntax

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

HTTP Client: Axios, node-fetch, fetch, or SuperAgent
HTML Parser: Cheerio

All-In-One Web Scraping Framework

Crawlee

Dynamic Web Pages

Playwright, Puppeteer, Selenium, or Cypress

Guides and Tutorials

General Guides

Web Scraping With JavaScript and Node.js Guide
Web Scraping With Next.JS in 2024
Web Scraping with Crawlee: Step-By-Step Tutorial
Using Cheerio NPM for Web Scraping
Playwright Web Scraping - 2024 Guide
Web Scraping with Puppeteer - 2024 Guide
HTTP Requests in Node.js with Fetch API

Proxies

How To Set a Proxy in Axios: Definitive Guide
How To Set a Proxy in SuperAgent
How to Use Proxy in Node-Fetch Guide
How to Use Proxy Servers in Node.js

User Agent Setting

Puppeteer User Agent Guide: Setting and Changing
Node.js User Agent Guide: Setting and Changing

Anti-Bot Bypass

Avoiding Bot Detection with Playwright Stealth
Avoid Getting Blocked With Puppeteer Stealth
How to Bypass CAPTCHAs with Playwright
Using Node Unblocker for Web Scraping

Comparisons

JavaScript vs Python for Web Scraping Comparison
C# vs JavaScript for Web Scraping
JavaScript vs Rust for Web Scraping
Cheerio vs. Puppeteer for Web Scraping
Puppeteer vs. Selenium - Which One to Choose?
Scrapy vs Puppeteer for Web Scraping
Playwright vs. Selenium Comparison 2024
Puppeteer vs Playwright for Web Scraping

Files

javascript.md

Latest commit

History

javascript.md

File metadata and controls

JavaScript Web Scraping

Table of Contents

Libraries

Network

HTTP Clients

WebSockets

Low Level

Other

Parsers

HTML/XML Parsers

URL Parsers

CSV Parsers

PDF Parsers

HTTP Parsers

Email Parsers

Markdown Parsers

YAML Parsers

SQL Parsers

Office File Parsers

Other

Web Scraping

Frameworks

Anti-Bot Bypass

Proxy Integration

CAPTCHA Solving

User-Agent Spoofing

Web Automation

Browser Automation Frameworks

Tools and Plugins

Other

Data Export

JSON

CSV

Other

Data Processing

Character Encoding

Date and Time

Prices

Phone Numbers

Slugs

Languages

Other

Task Scheduling

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

All-In-One Web Scraping Framework

Dynamic Web Pages

Guides and Tutorials

General Guides

Proxies

User Agent Setting

Anti-Bot Bypass

Comparisons