Skip to content

Latest commit

 

History

History
335 lines (239 loc) · 20.5 KB

javascript.md

File metadata and controls

335 lines (239 loc) · 20.5 KB

JavaScript Web Scraping

This document contains a list of libraries and resources for web scraping in JavaScript.

Table of Contents

Libraries

Note: All selected libraries are actively maintained.

Network

HTTP Clients

  • fetch: A built-in Node.js browser-compatible implementation of the fetch() function
  • axios: A Promise-based HTTP client for the browser and Node.js [Axios Proxy Integration]
  • node-fetch: A light-weight module that brings the Fetch API to Node.js [node-fetch Proxy Integration]
  • undici: An HTTP/1.1 client, written from scratch for Node.js
  • superagent: A small progressive client-side HTTP request library, and Node.js module with the same API, supporting many high-level HTTP client features [SuperAgent Proxy Integration]
  • urllib: A library to request HTTP(s) URLs in a complex world
  • node-libcurl: libcurl bindings for Node.js
  • got: A human-friendly and powerful HTTP request library for Node.js
  • bent: A functional HTTP client for Node.js w/ async/await
  • needle: A nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support

WebSockets

  • ws: A simple to use, blazing fast and thoroughly tested WebSocket client and server for Node.js
  • WebScoket-Node: A WebSocket implementation for Node.JS (Draft -08 through the final RFC 6455)

Low Level

  • node:net: A built-in Node.js module that provides an asynchronous network API for creating stream-based TCP or IPC servers and clients
  • multicast-dns: A low-level multicast DNS implementation in pure JavaScript
  • node-ip: IP address tools for node.js

Other

  • wreck: HTTP client utilities for the hapi web framework
  • proxy-chain: A Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining

Parsers

HTML/XML Parsers

  • cheerio: A fast, flexible, and elegant library for parsing and manipulating HTML and XML
  • fast-xml-parser: A library to validate XML, parse XML, and build XML rapidly without C/C++ based libraries and no callback
  • node-html-parser: A very fast HTML parser, generating a simplified DOM, with basic element query support
  • html-dom-parser: An HTML to DOM parser
  • parse5: An HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant
  • htmlparser2: A fast and forgiving HTML and XML parser
  • sax-js: A sax style parser for JS

URL Parsers

  • node:url: A built-in Node.js module that provides utilities for URL resolution and parsing
  • query-string: A library to parse and stringify URL query strings.
  • URI.js: A Javascript URL mutation library

CSV Parsers

  • csv-parse: A parser converting CSV text input into arrays or objects
  • fast-csv: A CSV parser and formatter for Node.js

PDF Parsers

  • pdf-parse: A pure JavaScript cross-platform module to extract texts from PDFs
  • pdf2json: A library that converts binary PDF to JSON and text, for server-side PDF processing and command-line use

HTTP Parsers

Email Parsers

Markdown Parsers

  • markdown-it: A Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
  • marked: A Markdown parser and compiler. Built for speed
  • micromark: A small, safe, and great commonmark (optionally gfm) compliant Markdown parser

YAML Parsers

  • yaml: A YAML parser and stringifier for JavaScript
  • js-yaml: A JavaScript YAML parser and dumper. Very fast
  • yaml-eslint-parser: A YAML parser that produces output compatible with ESLint

SQL Parsers

  • node-sql-parser: Parse simple SQL statements into an abstract syntax tree (AST) with the visited tableList and convert it back to SQL
  • js-sql-parser: An SQL parser written with jison. Parses SQL into abstract syntax tree (AST) and stringifies back to SQL

Office File Parsers

  • xlsx: A spreadsheet data parser and writer
  • exceljs: A library to read, manipulate, and write spreadsheet data and styles to XLSX and JSON
  • docx: A library to easily generate and modify .docx files with JS/TS with a nice declarative API. Works for Node.js and on the browser

Other

  • robots-parser: A Node.js robots.txt parser with support for wildcard (*) matching
  • sitemapper: A parser for XML Sitemaps to be used with robots.txt and web crawlers
  • ip-address: A library for parsing and manipulating IPv4 and IPv6 addresses in JavaScript
  • feed-extractor: Simplest way to read & normalize RSS/ATOM/JSON feed data
  • sanitize-html: A library to Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
  • parse-css: A standards-based CSS parser
  • js-xss: A library to sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist

Web Scraping

Frameworks

  • crawlee: A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation
  • node-crawler: A web crawler/spider for Node.JS + server-side jQuery
  • ayakashi: A next generation web scraping framework
  • webster: A reliable high-level web crawling & scraping framework for Node.js

Anti-Bot Bypass

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
  • nopecha-nodejs: An automated CAPTCHA solver for Node.js
  • 2captcha: A wrapper around the 2Captcha API
  • 2captcha-javascript: A JavaScript library for easy integration with the API of 2captcha captcha solving service to bypass reCAPTCHA, hCaptcha, funcaptcha, geetest and solve any other CAPTCHAs

User-Agent Spoofing

  • user-agents: A JavaScript library for generating random user agents with data that's updated daily

Web Automation

Browser Automation Frameworks

Tools and Plugins

Other

  • playwright-extra: A modular plugin framework for playwright to enable cool plugins through a clean interface
  • puppeteer-extra: A library to teach puppeteer new tricks through plugins

Data Export

JSON

  • JSON: A built-in JavaScript namespace that contains static methods for parsing values from and converting values to JavaScript Object Notation

CSV

Other

  • protobuf.js: Protocol buffers for JavaScript & TypeScript
  • jBinary: A high-level API for working with binary data

Data Processing

Character Encoding

  • encoding.js: A librarry to convert and detect character encoding in JavaScript
  • chardet: A character encoding detection tool for Node.js
  • iconv-lite: A library to convert character encodings in pure JavaScript

Date and Time

  • dayjs: A 2kB immutable date-time library alternative to Moment.js with the same modern API
  • date-fns: A modern JavaScript date utility library
  • luxon: A library for working with dates and times in JS

Prices

  • money.js: A tiny (1kb) javascript currency conversion library, for web & Node.js
  • currency.js: A JavaScript library for handling currencies

Phone Numbers

  • libphonenumber-js: A simpler (and smaller) rewrite of Google Android's libphonenumber library in JavaScript
  • phone: A library to validate and reformat the mobile phone number to the E.164 standard

Slugs

  • unique-slug: A library that slugifies even UTF-8 characters
  • unique-slug: A library to generate a unique character string suitible for use in files and URLs

Languages

  • remove-accents: A library that removes the accents from a string, converting them to their non-accented corresponding characters
  • nodejieba: A library that provides Chinese word segmentation for Node.js

Other

Task Scheduling

  • node-schedule: A cron-like and not-cron-like job scheduler for Node
  • node-cron: A simple cron-like job scheduler for Node.js
  • bree: A Node.js and JavaScript job task scheduler with worker threads, cron, Date, and human syntax
  • cron: A robust tool for running jobs (functions or commands) on schedules defined using the cron syntax

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: Axios, node-fetch, fetch, or SuperAgent
  • HTML Parser: Cheerio

All-In-One Web Scraping Framework

  • Crawlee

Dynamic Web Pages

  • Playwright, Puppeteer, Selenium, or Cypress

Guides and Tutorials

General Guides

Proxies

User Agent Setting

Anti-Bot Bypass

Comparisons