Skip to content

Latest commit

 

History

History
269 lines (184 loc) · 13.1 KB

perl.md

File metadata and controls

269 lines (184 loc) · 13.1 KB

Perl Web Scraping

This document contains a list of libraries and resources for web scraping in Perl.

Table of Contents

Libraries

Note: All selected libraries are either actively maintained or widely used.

Network

General

  • LWP: A collection of Perl modules that provides a simple, consistent application programming interface to the World-Wide Web
  • Plack: A PSGI toolkit and server adapters
  • Mojo: A Perl real-time web framework

HTTP Clients

WebSockets

  • Mojo::WebSocket: A library that implements the WebSocket protocol as described in RFC 6455

Low Level

  • Net::HTTP: A low-level HTTP connection client

Other

  • CGI: A library to handle Common Gateway Interface requests and responses

Parsers

HTML Parsers

XML Parsers

URL Parsers

  • URI::Info: A library to extract various information from a URI (URL)

HTTP Parsers

Email Parsers

Markdown Parsers

SQL Parsers

Other

Web Scraping

Frameworks

  • Scrappy: A powerful web spidering, scraping, creeping crawling framework
  • Web::Query: Yet another scraping library like jQuery
  • WWW::Scraper: A framework for scraping results from search engines

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
  • HTTP::Proxy: A pure Perl HTTP proxy
  • Net::Proxy: A framework for proxying network connections in many ways

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]

Web Automation

Browser Automation Frameworks

Tools and Plugins

Other

Data Export

JSON

  • JSON: A Perl implementation of a JSON encoder/decoder
  • JSON:PP: A pure Perl JSON decoder/encoder
  • JSON::XS: JSON serialising/deserialising, done correctly and fast
  • Geo::JSON: A Perl OO interface for GeoJSON
  • JSON::Syck: A YAML-based implementation of JSON parsing and generation

CSV

  • Text::CSV: Comma-separated values manipulator (using XS or PurePerl)

YAML

  • YAML::XS: A Perl YAML serialization module using XS and libyaml
  • YAML: A YAML Perl module
  • YAML::PP: A YAML 1.2 processor in Perl
  • YAML::Syck: A fast and lightweight YAML loader and dumper
  • YAML::Tiny: A library to read/write YAML files with as little code as possible

Other

Data Processing

Character Encoding

  • PerlIO::encoding: A built-in library to open a filehandle with a transparent encoding filter
  • Encoding::FixLatin: A library that takes mixed encoding input and produces UTF-8 output

Date and Time

  • DateTime: A date and time object for Perl
  • Mojo::Date: A library that implements HTTP date and time functions, based on RFC 7230, RFC 7231 and RFC 3339
  • HTTP::Date: A module that provides functions that deal the date formats used by the HTTP protocol
  • Date::Manip::Date: A library that provides methods for working with dates
  • APR::Date: Perl API for APR date manipulating functions
  • Date::Calc: A library for Gregorian calendar date calculations

Units of Measurement

Phone Numbers

  • Number::Phone: A large suite of perl modules for parsing and dealing with phone numbers

URLs and Network Addresses

  • Mojo::URL: A library that implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs
  • URI: A libray for uniform resource identifiers (absolute and relative)
  • URI::XS: A fast URI framework, compatible with classic URI, with C++ interface
  • URL::Encode: A library for encoding and decoding of application/x-www-form-urlencoded encoding

Languages

Other

Multiprocessing

  • MCE: A many-core engine for Perl providing parallel processing capabilities
  • Parallel::ForkManager: A simple parallel processing fork manager
  • Parallel::Runner: An object to manage running things in parallel processes

Task Scheduling

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: LWP::UserAgent, HTTP::Tiny, Mojo::UserAgent, or WWW::Mechanize
  • HTML Parser: HTML::TreeBuilder, HTML::Parser, or Mojo::DOM

All-In-One Web Scraping Framework

  • Scrappy or Web::Query

Dynamic Web Pages

All-In-One Browser Automation Framework

  • WWW::Mechanize::Chrome, Playwright, or Firefox::Marionette

Guides and Tutorials