Skip to content

Commit 6d912de

Browse files
authored
docs: add crawling best practices (#4590)
1 parent 83db0fc commit 6d912de

File tree

3 files changed

+79
-0
lines changed

3 files changed

+79
-0
lines changed
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Crawling
2+
3+
[RFC 9309](https://datatracker.ietf.org/doc/html/rfc9309) defines crawlers as automated clients.
4+
5+
Some web servers may reject requests that omit the `User-Agent` header or that use common defaults such as `'curl/7.79.1'`.
6+
7+
In **undici**, the default user agent is `'undici'`. Since undici is integrated into Node.js core as the implementation of `fetch()`, requests made via `fetch()` use `'node'` as the default user agent.
8+
9+
It is recommended to specify a **custom `User-Agent` header** when implementing crawlers. Providing a descriptive user agent allows servers to correctly identify the client and reduces the likelihood of requests being denied.
10+
11+
A user agent string should include sufficient detail to identify the crawler and provide contact information. For example:
12+
13+
```
14+
AcmeCo Crawler - acme.co - [email protected]
15+
```
16+
17+
When adding contact details, avoid using personal identifiers such as your own name or a private email address—especially in a professional or employment context. Instead, use a role-based or organizational contact (e.g., [email protected]) to protect individual privacy while still enabling communication.
18+
19+
If a crawler behaves unexpectedly—for example, due to misconfiguration or implementation errors—server administrators can use the information in the user agent to contact the operator and coordinate an appropriate resolution.
20+
21+
The `User-Agent` header can be set on individual requests or applied globally by configuring a custom dispatcher.
22+
23+
**Example: setting a `User-Agent` per request**
24+
25+
```js
26+
import { fetch } from 'undici'
27+
28+
const headers = {
29+
'User-Agent': 'AcmeCo Crawler - acme.co - [email protected]'
30+
}
31+
32+
const res = await fetch('https://example.com', { headers })
33+
```
34+
35+
## Best Practices for Crawlers
36+
37+
When developing a crawler, the following practices are recommended in addition to setting a descriptive `User-Agent` header:
38+
39+
* **Respect `robots.txt`**
40+
Follow the directives defined in the target site’s `robots.txt` file, including disallowed paths and optional crawl-delay settings (see [W3C guidelines](https://www.w3.org/wiki/Write_Web_Crawler)).
41+
42+
* **Rate limiting**
43+
Regulate request frequency to avoid imposing excessive load on servers. Introduce delays between requests or limit the number of concurrent requests. The W3C suggests at least one second between requests.
44+
45+
* **Error handling**
46+
Implement retry logic with exponential backoff for transient failures, and stop requests when persistent errors occur (e.g., HTTP 403 or 429).
47+
48+
* **Monitoring and logging**
49+
Track request volume, response codes, and error rates to detect misbehavior and address issues proactively.
50+
51+
* **Contact information**
52+
Always include valid and current contact details in the `User-Agent` string so that administrators can reach the crawler operator if necessary.
53+
54+
## References and Further Reading
55+
56+
* [RFC 9309: The Robots Exclusion Protocol](https://datatracker.ietf.org/doc/html/rfc9309)
57+
* [W3C Wiki: Write Web Crawler](https://www.w3.org/wiki/Write_Web_Crawler)
58+
* [Ethical Web Crawling (WWW 2010 Conference Paper)](https://archives.iw3c2.org/www2010/proceedings/www/p1101.pdf)

docs/docsify/sidebar.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,4 @@
4242
* [Client Certificate](/docs/best-practices/client-certificate.md "Connect using a client certificate")
4343
* [Writing Tests](/docs/best-practices/writing-tests.md "Using Undici inside tests")
4444
* [Mocking Request](/docs/best-practices/mocking-request.md "Using Undici inside tests")
45+
* [Crawling](/docs/best-practices/crawling.md "Crawling")

test/fetch/user-agent.js

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,23 @@ test('user-agent defaults correctly', async (t) => {
2525
t.assert.strictEqual(nodeBuildJSON.userAgentHeader, 'node')
2626
t.assert.strictEqual(undiciJSON.userAgentHeader, 'undici')
2727
})
28+
29+
test('set user-agent for fetch', async (t) => {
30+
const server = http.createServer({ joinDuplicateHeaders: true }, (req, res) => {
31+
res.end(JSON.stringify({ userAgentHeader: req.headers['user-agent'] }))
32+
})
33+
t.after(closeServerAsPromise(server))
34+
35+
server.listen(0)
36+
await events.once(server, 'listening')
37+
const url = `http://localhost:${server.address().port}`
38+
const [nodeBuildJSON, undiciJSON] = await Promise.all([
39+
nodeBuild.fetch(url, { headers: { 'user-agent': 'AcmeCo Crawler - acme.co - [email protected]' } }).then((body) => body.json()),
40+
undici.fetch(url, {
41+
headers: { 'user-agent': 'AcmeCo Crawler - acme.co - [email protected]' }
42+
}).then((body) => body.json())
43+
])
44+
45+
t.assert.strictEqual(nodeBuildJSON.userAgentHeader, 'AcmeCo Crawler - acme.co - [email protected]')
46+
t.assert.strictEqual(undiciJSON.userAgentHeader, 'AcmeCo Crawler - acme.co - [email protected]')
47+
})

0 commit comments

Comments
 (0)