Skip to content

Latest commit

 

History

History
250 lines (165 loc) · 10.9 KB

rust.md

File metadata and controls

250 lines (165 loc) · 10.9 KB

Rust Web Scraping

This document contains a list of crates and resources for web scraping in Rust.

Table of Contents

Libraries

Note: All selected crates are either widely used or actively maintained.

Network

HTTP Clients

  • reqwest: An ergonomic, batteries-included HTTP Client for Rust
  • ureq: A simple, safe HTTP client
  • curl-rust: Rust bindings to libcurl
  • attohttpc: A Rust lightweight HTTP 1.1 client
  • actix-web: A powerful, pragmatic, and extremely fast web framework for Rust
  • isahc: A practical HTTP client that is fun to use

WebSockets

  • rust-websocket: A WebSocket (RFC6455) library written in Rust
  • tungstenite-rs: A lightweight stream-based WebSocket implementation for Rust
  • websocket.rs: A WebSocket implementation for both client and server

Low Level

  • hyper: A low level protective and efficient HTTP library for all, meant to be a building block for libraries and applications
  • tiny-http: A low-level HTTP server library in Rust
  • libpnet: A crate for cross-platform, low level networking using the Rust programming language
  • pcap: A Rust language crate for accessing the packet sniffing capabilities of libpcap
  • rustls: A modern TLS library in Rust

Other

  • hyper-util: A collection of utilities to do common things with hyper
  • reqwest-middleware: A Wrapper around reqwest to allow for client middleware chains

Parsers

HTML/XML Parsers

  • scraper: HTML parsing and querying with CSS selectors
  • html5ever: A high-performance browser-grade HTML5 parser
  • select.rs: A Rust library to extract useful data from HTML documents, suitable for web scraping
  • quick-xml: A Rust high performance XML reader and writer
  • roxmltree: A crate to represent an XML document as a read-only tree

URL Parsers

HTTP Parsers

CSV Parsers

PDF Parsers

  • pdf-extract: A Rust library for extracting content from PDFs

Email Parsers

  • mailparse: A Rust library to parse mail files

Markdown Parsers

  • pulldown-cmark: An efficient, reliable parser for CommonMark, a standard dialect of Markdown
  • markdown-rs: A CommonMark compliant markdown parser in Rust with ASTs and extensions

YAML Parsers

SQL Parsers

Office File Parsers

  • calamine: A pure Rust Excel/OpenDocument SpreadSheets file reader

Other

  • pest: A general purpose parser written in Rust with a focus on accessibility, correctness, and performance
  • rust-cssparser: A Rust implementation of CSS Syntax Level 3
  • ammonia: A crate to repair and secure untrusted HTML
  • ttf-parser: A high-level, safe, zero-allocation TrueType font parser
  • robotstxt: A native Rust port of Google's robots.txt parser and matcher C++ library
  • rss: A library for serializing the RSS web content syndication format
  • collie: A minimal feed reader just for you

Web Scraping

Frameworks

  • spider: A web crawler and scraper for Rust
  • dyer: A Rust crate designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
  • challenge-bypass-ristretto: A Rust implemention of the privacy pass cryptographic protocol using the Ristretto group

Web Automation

Browser Automation Frameworks

Tools and Plugins

Data Processing

Serialization

  • serde: A serialization framework for Rust

Character Encoding

Text

  • rust-lexical: A Rust crate that provides fast numeric to- and from-string conversion routines

Date and Time

  • chrono: A date and time library for Rust
  • time: The most used Rust library for date and time handling
  • httpdate: HTTP date parsing and formatting

Phone Numbers

  • rust-phonenumber: A library for parsing, formatting and validating international phone numbers

Human Names

  • human-name: A Rust library for parsing and comparing human names

Slugs

  • slug-rs: A small library for generating ASCII slugs from unicode strings

Other

Multiprocessing

  • tokio: A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, etc.
  • rayon: A data parallelism library for Rust
  • async-task: A task abstraction for building executors

Task Scheduling

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: reqwest, ureq, curl-rust
  • HTML Parser: scraper, html5ever, select.rs, or quick-xml

All-In-One Web Scraping Framework

  • spider

Dynamic Web Pages

All-In-One Browser Automation Framework

  • rust-headless-chrome, thirtyfour, or chromiumoxide

Guides and Tutorials

General Guides

Proxies

Comparisons