This document contains a list of libraries and resources for web scraping in Perl.
- Libraries
- Popular Web Scraping Stacks
- Guides and Tutorials
Note: All selected libraries are either actively maintained or widely used.
- LWP: A collection of Perl modules that provides a simple, consistent application programming interface to the World-Wide Web
- Plack: A PSGI toolkit and server adapters
- Mojo: A Perl real-time web framework
- LWP::UserAgent: A class implementing a web user agent
- HTTP::Tiny: A small, simple, correct HTTP/1.1 client
- Mojo::UserAgent: A non-blocking I/O HTTP and WebSocket user agent
- WWW::Mechanize: A library for handy web browsing in a Perl object
- WWW::Curl::UserAgent: A web user agent based on libcurl
- REST::Client: A simple client for interacting with RESTful HTTP/HTTPS resources
- HTTP::Async: A library to process multiple HTTP requests in parallel without blocking
- Mojo::WebSocket: A library that implements the WebSocket protocol as described in RFC 6455
- Net::HTTP: A low-level HTTP connection client
- CGI: A library to handle Common Gateway Interface requests and responses
- HTML::TreeBuilder: A parser that builds a HTML syntax tree
- HTML::Parser: A collection of modules that parse and extract information from HTML documents
- Mojo::DOM: A minimalistic HTML/XML DOM parser with CSS selectors
- HTML::HTML5::Parser: A Perl library to parse HTML reliably
- XML::Parser: A Perl module for parsing XML documents
- XML::LibXML: Perl bindings for libxml2
- URI::Info: A library to extract various information from a URI (URL)
- HTTP::Entity::Parser: A PSGI compliant HTTP entity parser
- HTTP::Parser::XS: A fast, primitive HTTP request parser
- Plack::HTTPParser: A library to parse HTTP headers
- HTTP::Body: An HTTP body parser
- HTTP::Parser::XS: A fast, primitive HTTP request parser
- HTTP::Parser: A library to parse HTTP/1.1 requests into HTTP::Request/Response objects
- HTTP::Link::Parser: A library to parse HTTP "Link" headers
- HTTP::MultiPartParser: A low-level API for processing MultiPart MIME data streams
- Email::Address::XS: A library to parse and format RFC 5322 email addresses and groups
- Markdown::Parser: A Markdown parser only
- SQL::Parser: SQL parsing and processing engine
- HTML::TreeBuilder::XPath: A library to add XPath support to HTML::TreeBuilder
- [HTML::TreeBuilder::LibXML]: A HTML::TreeBuilder and XPath compatible interface with libxml
- WWW::RobotRules: A module to parses robots.txt files as specified in "A Standard for Robot Exclusion"
- XML::RSS::LibXML: A library that uses XML::LibXML (libxml2) for parsing RSS
- WWW::Sitemap::XML: A library to read and write sitemap XML files as defined at Sitemaps.org
- Scrappy: A powerful web spidering, scraping, creeping crawling framework
- Web::Query: Yet another scraping library like jQuery
- WWW::Scraper: A framework for scraping results from search engines
- Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
- HTTP::Proxy: A pure Perl HTTP proxy
- Net::Proxy: A framework for proxying network connections in many ways
- CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
- WWW::Mechanize::Chrome: A library to automate the Chrome browser
- Playwright: Perl bindings for Playwright
- Selenium::Edge: Perl bindings to the Selenium Webdriver server
- Firefox::Marionette: A library to automate the Firefox browser with the Marionette protocol
- WWW::Selenium: Perl Client for the Selenium Remote Control test tool
- Firefox::Marionette::Extension::Stealth: A Stealth extension for Firefox::Marionette
- Chrome::DevToolsProtocol: An asynchronous dispatcher for the DevTools protocol
- JavaScript::SpiderMonkey: A Perl interface to the SpiderMonkey JavaScript engine
- JSON: A Perl implementation of a JSON encoder/decoder
- JSON:PP: A pure Perl JSON decoder/encoder
- JSON::XS: JSON serialising/deserialising, done correctly and fast
- Geo::JSON: A Perl OO interface for GeoJSON
- JSON::Syck: A YAML-based implementation of JSON parsing and generation
- Text::CSV: Comma-separated values manipulator (using XS or PurePerl)
- YAML::XS: A Perl YAML serialization module using XS and libyaml
- YAML: A YAML Perl module
- YAML::PP: A YAML 1.2 processor in Perl
- YAML::Syck: A fast and lightweight YAML loader and dumper
- YAML::Tiny: A library to read/write YAML files with as little code as possible
- Excel::Writer::XLSX: A Perl module to create Excel XLSX files
- Pod::Markdown: A library to convert POD to Markdown
- PerlIO::encoding: A built-in library to open a filehandle with a transparent encoding filter
- Encoding::FixLatin: A library that takes mixed encoding input and produces UTF-8 output
- DateTime: A date and time object for Perl
- Mojo::Date: A library that implements HTTP date and time functions, based on RFC 7230, RFC 7231 and RFC 3339
- HTTP::Date: A module that provides functions that deal the date formats used by the HTTP protocol
- Date::Manip::Date: A library that provides methods for working with dates
- APR::Date: Perl API for APR date manipulating functions
- Date::Calc: A library for Gregorian calendar date calculations
- Class::Measure::Length: A library to create, compare, and convert units of measurement
- Number::Phone: A large suite of perl modules for parsing and dealing with phone numbers
- Mojo::URL: A library that implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs
- URI: A libray for uniform resource identifiers (absolute and relative)
- URI::XS: A fast URI framework, compatible with classic URI, with C++ interface
- URL::Encode: A library for encoding and decoding of application/x-www-form-urlencoded encoding
- Text::Unaccent: A library to remove accents from a string
- MCE: A many-core engine for Perl providing parallel processing capabilities
- Parallel::ForkManager: A simple parallel processing fork manager
- Parallel::Runner: An object to manage running things in parallel processes
- Schedule::Cron: A cron-like scheduler for Perl subroutines
- HTTP Client: LWP::UserAgent, HTTP::Tiny, Mojo::UserAgent, or WWW::Mechanize
- HTML Parser: HTML::TreeBuilder, HTML::Parser, or Mojo::DOM
- Scrappy or Web::Query
- WWW::Mechanize::Chrome, Playwright, or Firefox::Marionette