|
| 1 | +# Crawling |
| 2 | + |
| 3 | +[RFC 9309](https://datatracker.ietf.org/doc/html/rfc9309) defines crawlers as automated clients. |
| 4 | + |
| 5 | +Some web servers may reject requests that omit the `User-Agent` header or that use common defaults such as `'curl/7.79.1'`. |
| 6 | + |
| 7 | +In **undici**, the default user agent is `'undici'`. Since undici is integrated into Node.js core as the implementation of `fetch()`, requests made via `fetch()` use `'node'` as the default user agent. |
| 8 | + |
| 9 | +It is recommended to specify a **custom `User-Agent` header** when implementing crawlers. Providing a descriptive user agent allows servers to correctly identify the client and reduces the likelihood of requests being denied. |
| 10 | + |
| 11 | +A user agent string should include sufficient detail to identify the crawler and provide contact information. For example: |
| 12 | + |
| 13 | +``` |
| 14 | +AcmeCo Crawler - acme.co - [email protected] |
| 15 | +``` |
| 16 | + |
| 17 | +When adding contact details, avoid using personal identifiers such as your own name or a private email address—especially in a professional or employment context. Instead, use a role-based or organizational contact (e.g., [email protected]) to protect individual privacy while still enabling communication. |
| 18 | + |
| 19 | +If a crawler behaves unexpectedly—for example, due to misconfiguration or implementation errors—server administrators can use the information in the user agent to contact the operator and coordinate an appropriate resolution. |
| 20 | + |
| 21 | +The `User-Agent` header can be set on individual requests or applied globally by configuring a custom dispatcher. |
| 22 | + |
| 23 | +**Example: setting a `User-Agent` per request** |
| 24 | + |
| 25 | +```js |
| 26 | +import { fetch } from 'undici' |
| 27 | + |
| 28 | +const headers = { |
| 29 | + 'User-Agent': 'AcmeCo Crawler - acme.co - [email protected]' |
| 30 | +} |
| 31 | + |
| 32 | +const res = await fetch('https://example.com', { headers }) |
| 33 | +``` |
| 34 | + |
| 35 | +## Best Practices for Crawlers |
| 36 | + |
| 37 | +When developing a crawler, the following practices are recommended in addition to setting a descriptive `User-Agent` header: |
| 38 | + |
| 39 | +* **Respect `robots.txt`** |
| 40 | + Follow the directives defined in the target site’s `robots.txt` file, including disallowed paths and optional crawl-delay settings (see [W3C guidelines](https://www.w3.org/wiki/Write_Web_Crawler)). |
| 41 | + |
| 42 | +* **Rate limiting** |
| 43 | + Regulate request frequency to avoid imposing excessive load on servers. Introduce delays between requests or limit the number of concurrent requests. The W3C suggests at least one second between requests. |
| 44 | + |
| 45 | +* **Error handling** |
| 46 | + Implement retry logic with exponential backoff for transient failures, and stop requests when persistent errors occur (e.g., HTTP 403 or 429). |
| 47 | + |
| 48 | +* **Monitoring and logging** |
| 49 | + Track request volume, response codes, and error rates to detect misbehavior and address issues proactively. |
| 50 | + |
| 51 | +* **Contact information** |
| 52 | + Always include valid and current contact details in the `User-Agent` string so that administrators can reach the crawler operator if necessary. |
| 53 | + |
| 54 | +## References and Further Reading |
| 55 | + |
| 56 | +* [RFC 9309: The Robots Exclusion Protocol](https://datatracker.ietf.org/doc/html/rfc9309) |
| 57 | +* [W3C Wiki: Write Web Crawler](https://www.w3.org/wiki/Write_Web_Crawler) |
| 58 | +* [Ethical Web Crawling (WWW 2010 Conference Paper)](https://archives.iw3c2.org/www2010/proceedings/www/p1101.pdf) |
0 commit comments