Skip to content

Latest commit

 

History

History
267 lines (179 loc) · 11.7 KB

go.md

File metadata and controls

267 lines (179 loc) · 11.7 KB

Go Web Scraping

This document contains a list of libraries and resources for web scraping in Go.

Table of Contents

Libraries

Note: All selected libraries are either widely used or actively maintained.

Network

HTTP Clients

  • net/http: A built-in Go package that provides HTTP client and server implementations
  • fasthttp: A fast HTTP implementation for Go
  • resty: A simple HTTP and REST client library for Go
  • req: A simple Go HTTP client with Black Magic
  • requests: HTTP requests for Gophers
  • heimdall: An enhanced HTTP client for Go
  • go-retryablehttp: A retryable HTTP client in Go
  • retryablehttp-go: A package that provides a familiar HTTP client interface with automatic retries and exponential backoff
  • sling: A Go HTTP client library for creating and sending API requests
  • gorequest: A simplified HTTP client (inspired by Node.js's SuperAgent)

WebSockets

  • gorilla: A fast, well-tested and widely used WebSocket implementation for Go
  • websocket: A minimal and idiomatic WebSocket library for Go

Low Level

  • net: A built-in Go portable interface for network I/O, including TCP/IP, UDP, domain name resolution, and Unix domain sockets
  • gots: A Go library for MPEG transport stream handling in Go

Other

  • caddy: A fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS

Parsers

HTML/XML Parsers

  • goquery: A package that brings a syntax and a set of features similar to jQuery to the Go language
  • encoding/xml: A built-in Go simple XML 1.0 parser that understands XML name spaces
  • net/html: A built-in Go Package html implements an HTML5-compliant tokenizer and parser
  • xml-stream-parser: An XML stream parser for GO
  • pagser: A simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

URL Parsers

  • net/url: A built-in Go package that parses URLs and implements query escaping
  • urlquery: A URL query string encoder and parser based on Go

Date and Time Parsers

  • dateparse: A library to parse many date strings without knowing format in advance

JSON Parsers

  • jsonparser: One of the fastest alternative JSON parser for Go that does not require schema

PDF Parsers

  • pdfcpu: A PDF processor written in Go
  • pdf: A PDF reader in Golang

Email Parsers

  • net/mail: A built-in Go package that implements parsing of mail messages

Markdown Parsers

  • markdown: A Markdown parser and HTML renderer for Go
  • blackfriday: A Markdown processor for Go
  • goldmark: A Markdown parser written in Go. Easy to extend, standard (CommonMark) compliant, well structured

SQL Parsers

Other

  • grobotstxt: A native Go port of Google's robots.txt parser and matcher library
  • gofeed: A library to parse RSS, Atom and JSON feeds in Go
  • go-flags: A Go command line option parser
  • toml: A TOML parser for Golang with reflection

Web Scraping

Frameworks

  • colly: An elegant scraper and crawler framework for Golang
  • surf: A library for stateful programmatic web browsing in Go
  • gospider: A fast web spider written in Go
  • ferret: A library for declarative web scraping
  • pholcus: A distributed high-concurrency crawler software written in pure golang

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
  • goproxy: An HTTP proxy library for Go

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
  • captcha: A package that implements generation and verification of image and audio CAPTCHAs

Web Automation

Browser Automation Frameworks

  • chromedp: A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol
  • rod: A Chrome DevTools Protocol driver for web automation and scraping
  • selenium: A Selenium/Webdriver client for Go
  • playwright-go: A browser automation library to control Chromium, Firefox and WebKit with a single API. A port of Playwright for Go

Tools and Plugins

Other

  • robotgo: A Go native cross-platform RPA and GUI automation library

Data Export

Serialization

  • mus-go: A set of serialization primitives for Golang

JSON

  • encoding/json: A built-in Go package that implements encoding and decoding of JSON as defined in RFC 7159.

CSV

  • encoding/csv: A built-in Go packate that reads and writes comma-separated values (CSV) files

Other

  • unioffice: A pure go library for creating and processing Office Word (.docx), Excel (.xlsx) and Powerpoint (.pptx) documents
  • yaml: YAML support for the Go language

Data Processing

Text

  • x/text: Built-in Go libraries for text processing, many involving Unicode
  • strings: A built-in Go package that implements simple functions to manipulate UTF-8 encoded strings

Character Encoding

  • enconding: A built-in Go libary that defines interfaces shared by other packages that convert data to and from byte-level and textual representations
  • utf8: A built-in Go package that implements functions and constants to support text encoded in UTF-8

Date and Time

  • time: A built-in Go package that provides functionality for measuring and displaying time
  • goment: A Go time library inspired by Moment.js
  • now: A time toolkit for golang
  • carbon: A simple, semantic and developer-friendly golang package for time

Phone Numbers

  • phonenumbers: The GoLang port of Google's libphonenumber library
  • phonenumber: A library that, with a given country and phone number, validates and formats the mobile phone number to E.164 standard

Slugs

  • slug: A URL-friendly slugify with multiple languages support

Other

Multiprocessing

  • async: A safe way to execute functions asynchronously, recovering them in case of panic. It also provides an error stack aiming to facilitate fail causes discovery

Task Scheduling

  • gocron: A Golang job scheduling package
  • go-quartz: A minimalist and zero-dependency scheduling library for Go
  • cron: A cron library for Go

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: net/http, req, or go-retryablehttp
  • HTML Parser: goquery

All-In-One Web Scraping Framework

  • colly

Dynamic Web Pages

All-In-One Browser Automation Framework

  • chromedp, rod, or playwright-go

Guides and Tutorials

General Guides

Proxies

Comparisons