Perl Web Scraping

This document contains a list of libraries and resources for web scraping in Perl.

Libraries

Note: All selected libraries are either actively maintained or widely used.

Network

General

LWP: A collection of Perl modules that provides a simple, consistent application programming interface to the World-Wide Web
Plack: A PSGI toolkit and server adapters
Mojo: A Perl real-time web framework

HTTP Clients

LWP::UserAgent: A class implementing a web user agent
HTTP::Tiny: A small, simple, correct HTTP/1.1 client
Mojo::UserAgent: A non-blocking I/O HTTP and WebSocket user agent
WWW::Mechanize: A library for handy web browsing in a Perl object
WWW::Curl::UserAgent: A web user agent based on libcurl
REST::Client: A simple client for interacting with RESTful HTTP/HTTPS resources
HTTP::Async: A library to process multiple HTTP requests in parallel without blocking

WebSockets

Mojo::WebSocket: A library that implements the WebSocket protocol as described in RFC 6455

Low Level

Net::HTTP: A low-level HTTP connection client

Other

CGI: A library to handle Common Gateway Interface requests and responses

Parsers

HTML Parsers

HTML::TreeBuilder: A parser that builds a HTML syntax tree
HTML::Parser: A collection of modules that parse and extract information from HTML documents
Mojo::DOM: A minimalistic HTML/XML DOM parser with CSS selectors
HTML::HTML5::Parser: A Perl library to parse HTML reliably

XML Parsers

XML::Parser: A Perl module for parsing XML documents
XML::LibXML: Perl bindings for libxml2

URL Parsers

URI::Info: A library to extract various information from a URI (URL)

HTTP Parsers

HTTP::Entity::Parser: A PSGI compliant HTTP entity parser
HTTP::Parser::XS: A fast, primitive HTTP request parser
Plack::HTTPParser: A library to parse HTTP headers
HTTP::Body: An HTTP body parser
HTTP::Parser::XS: A fast, primitive HTTP request parser
HTTP::Parser: A library to parse HTTP/1.1 requests into HTTP::Request/Response objects
HTTP::Link::Parser: A library to parse HTTP "Link" headers
HTTP::MultiPartParser: A low-level API for processing MultiPart MIME data streams

Email Parsers

Email::Address::XS: A library to parse and format RFC 5322 email addresses and groups

Markdown Parsers

Markdown::Parser: A Markdown parser only

SQL Parsers

SQL::Parser: SQL parsing and processing engine

Other

HTML::TreeBuilder::XPath: A library to add XPath support to HTML::TreeBuilder
[HTML::TreeBuilder::LibXML]: A HTML::TreeBuilder and XPath compatible interface with libxml
WWW::RobotRules: A module to parses robots.txt files as specified in "A Standard for Robot Exclusion"
XML::RSS::LibXML: A library that uses XML::LibXML (libxml2) for parsing RSS
WWW::Sitemap::XML: A library to read and write sitemap XML files as defined at Sitemaps.org

Web Scraping

Frameworks

Scrappy: A powerful web spidering, scraping, creeping crawling framework
Web::Query: Yet another scraping library like jQuery
WWW::Scraper: A framework for scraping results from search engines

Proxy Integration

Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
HTTP::Proxy: A pure Perl HTTP proxy
Net::Proxy: A framework for proxying network connections in many ways

CAPTCHA Solving

CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]

Web Automation

Browser Automation Frameworks

WWW::Mechanize::Chrome: A library to automate the Chrome browser
Playwright: Perl bindings for Playwright
Selenium::Edge: Perl bindings to the Selenium Webdriver server
Firefox::Marionette: A library to automate the Firefox browser with the Marionette protocol
WWW::Selenium: Perl Client for the Selenium Remote Control test tool

Tools and Plugins

Firefox::Marionette::Extension::Stealth: A Stealth extension for Firefox::Marionette

Other

Chrome::DevToolsProtocol: An asynchronous dispatcher for the DevTools protocol
JavaScript::SpiderMonkey: A Perl interface to the SpiderMonkey JavaScript engine

Data Export

JSON

JSON: A Perl implementation of a JSON encoder/decoder
JSON:PP: A pure Perl JSON decoder/encoder
JSON::XS: JSON serialising/deserialising, done correctly and fast
Geo::JSON: A Perl OO interface for GeoJSON
JSON::Syck: A YAML-based implementation of JSON parsing and generation

CSV

Text::CSV: Comma-separated values manipulator (using XS or PurePerl)

YAML

YAML::XS: A Perl YAML serialization module using XS and libyaml
YAML: A YAML Perl module
YAML::PP: A YAML 1.2 processor in Perl
YAML::Syck: A fast and lightweight YAML loader and dumper
YAML::Tiny: A library to read/write YAML files with as little code as possible

Other

Excel::Writer::XLSX: A Perl module to create Excel XLSX files
Pod::Markdown: A library to convert POD to Markdown

Data Processing

Character Encoding

PerlIO::encoding: A built-in library to open a filehandle with a transparent encoding filter
Encoding::FixLatin: A library that takes mixed encoding input and produces UTF-8 output

Date and Time

DateTime: A date and time object for Perl
Mojo::Date: A library that implements HTTP date and time functions, based on RFC 7230, RFC 7231 and RFC 3339
HTTP::Date: A module that provides functions that deal the date formats used by the HTTP protocol
Date::Manip::Date: A library that provides methods for working with dates
APR::Date: Perl API for APR date manipulating functions
Date::Calc: A library for Gregorian calendar date calculations

Units of Measurement

Class::Measure::Length: A library to create, compare, and convert units of measurement

Phone Numbers

Number::Phone: A large suite of perl modules for parsing and dealing with phone numbers

URLs and Network Addresses

Mojo::URL: A library that implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs
URI: A libray for uniform resource identifiers (absolute and relative)
URI::XS: A fast URI framework, compatible with classic URI, with C++ interface
URL::Encode: A library for encoding and decoding of application/x-www-form-urlencoded encoding

Files

perl.md

Latest commit

History

perl.md

File metadata and controls

Perl Web Scraping

Table of Contents

Libraries

Network

General

HTTP Clients

WebSockets

Low Level

Other

Parsers

HTML Parsers

XML Parsers

URL Parsers

HTTP Parsers

Email Parsers

Markdown Parsers

SQL Parsers

Other

Web Scraping

Frameworks

Proxy Integration

CAPTCHA Solving

Web Automation

Browser Automation Frameworks

Tools and Plugins

Other

Data Export

JSON

CSV

YAML

Other

Data Processing

Character Encoding

Date and Time

Units of Measurement

Phone Numbers

URLs and Network Addresses

Languages

Other

Multiprocessing

Task Scheduling

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

All-In-One Web Scraping Framework

Dynamic Web Pages

All-In-One Browser Automation Framework

Guides and Tutorials