Skip to content

Commit f5a7dfb

Browse files
committed
📦 v2.0.0
📦 NPM: [`@2.0.0`](https://www.npmjs.com/package/spiderable-middleware) ☄️ Packosphere [`@2.0.0`](https://packosphere.com/ostrio/spiderable-middleware) __Major Changes:__ - ⚠️ Removed `request-libcurl` dependency, replaced with `https` module - ⚠️ Removed legacy `url` usage, replaced with `new URL` constructor - ⚠️ Old `requestOptions` doesn't match `http` module options __Changes:__ - ✨ new `sanitizeUrls` option (see docs) - ✨ `requestOptions` now passed to `http` module `.request` method - 📔 Docs update to match changes __Dependencies__: - 📦 `[removed]` `[email protected]` - 📦 `[dev]` `[email protected]`, *was `v4.3.6`* - 📦 `[dev]` `[email protected]`, *was `v4.18.1`* - 📦 `[dev]` `[email protected]`, *was `v10.0.0`*
1 parent ff4eab1 commit f5a7dfb

File tree

6 files changed

+421
-3601
lines changed

6 files changed

+421
-3601
lines changed

.versions

Lines changed: 34 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,52 @@
11
2-
babel-compiler@7.9.0
2+
babel-compiler@7.10.5
33
44
55
6-
7-
callback-hook@1.4.0
8-
9-
10-
ddp-client@2.5.0
6+
7+
callback-hook@1.5.1
8+
9+
10+
ddp-client@2.6.1
1111
12-
ddp-server@2.5.0
13-
14-
15-
16-
12+
ddp-server@2.7.0
13+
14+
15+
16+
1717
1818
19-
20-
21-
19+
20+
21+
2222
2323
2424
25-
local-test:ostrio:spiderable-middleware@1.6.6
26-
27-
meteor@1.10.0
28-
minimongo@1.8.0
29-
30-
modules@0.18.0
31-
32-
mongo@1.15.0
25+
local-test:ostrio:spiderable-middleware@2.0.0
26+
27+
meteor@1.11.4
28+
minimongo@1.9.3
29+
30+
modules@0.20.0
31+
32+
mongo@1.16.8
3333
3434
3535
36-
npm-mongo@4.3.1
36+
npm-mongo@4.17.2
3737
38-
ostrio:spiderable-middleware@1.6.6
39-
40-
41-
38+
ostrio:spiderable-middleware@2.0.0
39+
40+
41+
4242
4343
4444
45-
46-
47-
48-
45+
46+
47+
48+
49+
4950
50-
51-
51+
52+

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Google, Facebook, Twitter, Yahoo, and Bing and all other crawlers and search eng
2323
- [Meteor.js usage](https://github.com/veliovgroup/spiderable-middleware/blob/master/docs/meteor.md)
2424
- [Return genuine status code](https://github.com/veliovgroup/spiderable-middleware#return-genuine-status-code)
2525
- [Speed-up rendering](https://github.com/veliovgroup/spiderable-middleware#speed-up-rendering)
26-
- [Detect request from Prerendering engine during runtime](https://github.com/veliovgroup/spiderable-middleware#detect-request-from-pre-rendering-engine-during-runtime)
26+
- [Detect request from Prerendering engine during runtime](https://github.com/veliovgroup/spiderable-middleware#user-content-detect-request-from-pre-rendering-engine-during-runtime)
2727
- [JavaScript redirects](https://github.com/veliovgroup/spiderable-middleware#javascript-redirects)
2828
- [AMP Support](https://github.com/veliovgroup/spiderable-middleware#amp-support)
2929
- [Rendering Endpoints](https://github.com/veliovgroup/spiderable-middleware#rendering-endpoints)
@@ -202,13 +202,14 @@ __Note__: *Only 4 redirects are allowed during one request after 4 redirects ses
202202
- `opts.serviceURL` {*String*} - Valid URL to Spiderable endpoint (local or foreign). Default: `https://render.ostr.io`. Can be set via environment variables: `SPIDERABLE_SERVICE_URL` or `PRERENDER_SERVICE_URL`
203203
- `opts.rootURL` {*String*} - Valid root URL of your website. Can be set via an environment variable: `ROOT_URL`
204204
- `opts.auth` {*String*} - [Optional] Auth string in next format: `user:pass`. Can be set via an environment variables: `SPIDERABLE_SERVICE_AUTH` or `PRERENDER_SERVICE_AUTH`. Default `null`
205+
- `opts.sanitizeUrls` {*Boolean*} - [Optional] Sanitize URLs in order to "fix" badly composed URLs. Default `false`
205206
- `opts.botsUA` {*[String]*} - [Optional] An array of strings (case insensitive) with additional User-Agent names of crawlers you would like to intercept. See default [bot's names](https://github.com/veliovgroup/spiderable-middleware/blob/master/lib/index.js#L119). Set to `['.*']` to match all browsers and robots, to serve static pages to all users/visitors
206207
- `opts.ignoredHeaders` {*[String]*} - [Optional] An array of strings (case insensitive) with HTTP header names to exclude from response. See default [list of ignored headers](https://github.com/veliovgroup/spiderable-middleware/blob/master/lib/index.js#L121). Set to `['.*']` to ignore all headers
207208
- `opts.ignore` {*[String]*} - [Optional] An array of strings (case __sensitive__) with ignored routes. Note: it's based on first match, so route `/users` will cause ignoring of `/part/users/part`, `/users/_id` and `/list/of/users`, but not `/user/_id` or `/list/of/blocked-users`. Default `null`
208209
- `opts.only` {*[String|RegExp]*} - [Optional] An array of strings (case __sensitive__) or regular expressions (*could be mixed*). Define __exclusive__ route rules for pre-rendering. Could be used with `opts.onlyRE` rules. __Note:__ To define "safe" rules as {*RegExp*} it should start with `^` and end with `$` symbols, examples: `[/^\/articles\/?$/, /^\/article\/[A-z0-9]{16}\/?$/]`
209210
- `opts.onlyRE` {*RegExp*} - [Optional] Regular Expression with __exclusive__ route rules for pre-rendering. Could be used with `opts.only` rules
210211
- `opts.timeout` {*Number*} - [Optional] Number, proxy-request timeout to rendering endpoint in milliseconds. Default: `180000`
211-
- `opts.requestOptions` {*Object*} - [Optional] Options for request module (like: `timeout`, `debug`, `proxy`), for all available options see [`request-libcurl` API docs](https://github.com/veliovgroup/request-extra#request-options)
212+
- `opts.requestOptions` {*Object*} - [Optional] Options for request module (like: `timeout`, `lookup`, `insecureHTTPParser`), for all available options see [`http` API docs](https://nodejs.org/docs/latest-v14.x/api/http.html#http_http_request_url_options_callback)
212213
- `opts.debug` {*Boolean*} - [Optional] Enable debug and extra logging, default: `false`
213214

214215
__Note:__ *Setting* `.onlyRE` *and/or* `.only` *rules are highly recommended. Otherwise, all routes, including randomly generated by bots will be subject of Pre-rendering and may cause unexpectedly higher usage.*

lib/index.js

Lines changed: 100 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
var url = require('url');
2-
var request = require('request-libcurl');
1+
var https = require('https');
32

43
if (typeof window !== 'undefined') {
54
throw new Error('Running `spiderable-middleware` in Browser environment isn\'t allowed! Please make sure `spiderable-middleware` NPM package is imported and used only in Node.js environment.');
65
}
76

7+
var keepAliveAgent = new https.Agent({ keepAlive: true });
88
var strs = {
99
pipe: '|',
1010
empty: '',
@@ -34,20 +34,6 @@ var _warn = function warn (...args) {
3434
console.warn.call(console, '[WARN] [Spiderable-Middleware]', ...args);
3535
};
3636

37-
request.defaultOptions.debug = false;
38-
request.defaultOptions.headers = { 'User-Agent': 'spiderable-middleware/1.6.4', Accept: '*/*' };
39-
request.defaultOptions.noStorage = true;
40-
request.defaultOptions.rawBody = true;
41-
request.defaultOptions.retry = true;
42-
request.defaultOptions.retries = 3;
43-
request.defaultOptions.retryDelay = 128;
44-
request.defaultOptions.timeout = 102400;
45-
request.defaultOptions.wait = true;
46-
request.defaultOptions.badStatuses = [502, 503, 504, 599];
47-
request.defaultOptions.isBadStatus = function (statusCode, badStatuses) {
48-
return badStatuses.includes(statusCode);
49-
};
50-
5137
module.exports = (function () {
5238
function Spiderable(_opts) {
5339
var opts = {};
@@ -65,6 +51,7 @@ module.exports = (function () {
6551
this.timeout = opts.timeout || 180000;
6652
this.staticExt = opts.staticExt || re.staticExt;
6753
this.serviceURL = opts.serviceURL || process.env.SPIDERABLE_SERVICE_URL || process.env.PRERENDER_SERVICE_URL || 'https://render.ostr.io';
54+
this.sanitizeUrls = opts.sanitizeUrls || false;
6855
this.ignoredHeaders = opts.ignoredHeaders || Spiderable.prototype.ignoredHeaders;
6956
this.requestOptions = opts.requestOptions || {};
7057

@@ -148,8 +135,22 @@ module.exports = (function () {
148135
return next();
149136
}
150137

151-
var urlObj = url.parse(req.url, true);
152-
if ((urlObj.query && urlObj.query._escaped_fragment_ !== void 0) || this.botsRE.test(req.headers[strs.ua] || strs.empty)) {
138+
var path = req.url;
139+
if (this.sanitizeUrls) {
140+
path = path.replace(/\/+/g, '/');
141+
}
142+
143+
var urlObj;
144+
try {
145+
urlObj = new URL(path, this.rootURL);
146+
} catch (e) {
147+
// BAD URL IS PASSED!
148+
// IGNORING AND PASSING DOWN TO THE APP
149+
return next();
150+
}
151+
152+
var escapedFragment = urlObj.searchParams.has('_escaped_fragment_') ? urlObj.searchParams.get('_escaped_fragment_') : false;
153+
if (escapedFragment !== false || this.botsRE.test(req.headers[strs.ua] || strs.empty)) {
153154
var hasIgnored = false;
154155
var hasOnly = false;
155156

@@ -193,96 +194,121 @@ module.exports = (function () {
193194
}
194195

195196
var reqUrl = this.rootURL;
196-
197-
urlObj.path = urlObj.path.replace(re.trailingSlash, strs.empty).replace(re.beginningSlash, strs.empty);
198-
if (urlObj.query && typeof urlObj.query._escaped_fragment_ === 'string' && urlObj.query._escaped_fragment_.length) {
199-
urlObj.pathname += '/' + urlObj.query._escaped_fragment_.replace(re.beginningSlash, strs.empty);
197+
urlObj.pathname = urlObj.pathname.replace(re.beginningSlash, strs.empty);
198+
if (typeof escapedFragment === 'string' && escapedFragment.length) {
199+
urlObj.pathname += '/' + (escapedFragment.replace(re.beginningSlash, strs.empty));
200200
}
201201

202-
reqUrl += '/' + urlObj.pathname;
203-
reqUrl = reqUrl.replace(/([^:]\/)\/+/g, '$1');
202+
reqUrl += '/' + urlObj.pathname.replace(re.beginningSlash, strs.empty);
203+
// reqUrl = reqUrl.replace(/([^:]\/)\/+/g, '$1');
204204
reqUrl = (this.serviceURL + '/?url=' + encodeURIComponent(reqUrl));
205205

206206
if (req.headers[strs.ua]) {
207207
reqUrl += '&bot=' + encodeURIComponent(req.headers[strs.ua]);
208208
}
209209

210-
var opts = Object.assign({}, this.requestOptions, {
211-
uri: reqUrl,
212-
auth: this.auth || false,
213-
debug: this.debug
214-
});
210+
var reqHeaders = {
211+
'User-Agent': 'spiderable-middleware/2.0.0',
212+
Accept: '*/*',
213+
};
214+
215+
if (this.auth) {
216+
reqHeaders.Authorization = 'Basic ' + Buffer.from(this.auth).toString('base64');
217+
}
218+
219+
var payload = Object.assign({
220+
method: method.toUpperCase(),
221+
headers: reqHeaders,
222+
agent: keepAliveAgent,
223+
}, this.requestOptions);
215224

216225
try {
217-
var usedHeaders = [];
218226
var _headersRE = this.headersRE;
219-
var serviceReq = request(opts, function (error, resp) {
220-
if (error) {
221-
// DO NOT THROW AN ERROR ABOUT ABORTED REQUESTS
222-
if (!req.aborted && error.statusCode !== 499) {
223-
_warn('Error while connecting to external service:', error);
224-
next();
227+
var url = new URL(reqUrl);
228+
229+
var serviceReq = https.request(url, payload, function (resp) {
230+
for (var _hName in resp.headers) {
231+
if (resp.headers[_hName]) {
232+
var hName = _hName.toLowerCase();
233+
if (!res.headersSent && !_headersRE.test(hName)) {
234+
res.setHeader(hName, resp.headers[_hName]);
235+
}
225236
}
226-
} else {
227-
if (resp.statusCode === 401 || resp.statusCode === 403) {
228-
_warn('Can\'t authenticate! Please check you "auth" parameter and other settings.');
237+
}
238+
239+
if (resp.statusCode === 401 || resp.statusCode === 403) {
240+
_warn('Can\'t authenticate! Please check you "auth" parameter and other settings.');
241+
}
242+
243+
if (method === strs.head) {
244+
res.writeHead(resp.statusCod);
245+
res.end();
246+
return;
247+
}
248+
249+
if (!res.headersSent) {
250+
res.writeHead(resp.statusCode);
251+
}
252+
253+
resp.on('data', function (data) {
254+
if (!res.finished && !res.writableEnded) {
255+
res.write(data);
229256
}
257+
});
230258

231-
if (method === strs.head) {
232-
res.end();
259+
resp.on('end', function (data) {
260+
if (!res.finished && !res.writableEnded) {
261+
res.end(data);
233262
}
234-
}
263+
});
235264
});
236265

237-
serviceReq.onHeader(function (header) {
238-
var h = header.toString('utf8');
239-
if (h.includes(strs.semicolon)) {
240-
h = h.split(strs.semicolon);
241-
h[0] = h[0].trim().toLowerCase();
242-
h[1] = h[1].replace(re.newLine, strs.empty).trim();
243-
244-
if (!res.headersSent && h[1].length && !usedHeaders.includes(h[0])) {
245-
if (h[0] === 'status') {
246-
var status = h[1].match(re.digit);
247-
if (status && status[0]) {
248-
res.statusCode = status[0];
249-
usedHeaders.push(h[0]);
250-
}
251-
} else if (!_headersRE.test(h[0])) {
252-
try {
253-
res.setHeader(h[0], h[1]);
254-
} catch (e) {
255-
_warn('.setHeader() Error:', e);
256-
}
257-
usedHeaders.push(h[0]);
258-
}
266+
var onEnd = function (error) {
267+
if (error) {
268+
// DO NOT THROW AN ERROR ABOUT ABORTED REQUESTS
269+
if (!req.writableEnded && !req.aborted && !req.destroyed && error?.statusCode !== 499) {
270+
_warn('Error while connecting to external service:', error);
271+
next();
272+
return;
259273
}
260274
}
261-
});
262275

263-
if (method !== strs.head) {
264-
serviceReq.pipe(res);
265-
}
276+
if (!res.headersSent) {
277+
res.writeHead(200);
278+
}
279+
if (!res.finished && !res.writableEnded) {
280+
res.end();
281+
}
282+
if (!serviceReq.writableEnded && !serviceReq.aborted && !serviceReq.destroyed) {
283+
serviceReq.end();
284+
}
285+
};
286+
287+
serviceReq.on('abort', onEnd);
288+
serviceReq.on('error', onEnd);
289+
serviceReq.on('timeout', onEnd);
290+
serviceReq.setNoDelay(true);
291+
serviceReq.setTimeout(this.timeout, onEnd);
266292

267293
req.on('error', function (error) {
268294
_warn('[REQ] ["error" event] Unexpected error:', error);
269-
serviceReq.abort();
295+
serviceReq.destroy();
270296
next();
271297
});
272298

273299
res.on('error', function (error) {
274300
_warn('[RES] ["error" event] Unexpected error:', error);
275-
serviceReq.abort();
301+
serviceReq.destroy();
276302
next();
277303
});
278304

279305
req.on('aborted', function () {
280306
// No need to log this event as nothing bad happened
281307
// this simply means host which sent this request
282308
// has aborted the connection or got disconnected
283-
// _warn('[REQ] ["aborted" event]:', arguments); // TODO: comment out
309+
_warn('[REQ] ["aborted" event]:', arguments); // TODO: comment out
284310
req.aborted = true;
285-
serviceReq.abort();
311+
serviceReq.destroy();
286312
try {
287313
res.end();
288314
} catch (e) {
@@ -296,13 +322,8 @@ module.exports = (function () {
296322
// SET TIMEOUT AS A PRECAUTION
297323
res.setTimeout(this.timeout);
298324

299-
// res.on('close', function () {
300-
// _warn('[RES] ["close" event]:', arguments);
301-
// serviceReq.abort();
302-
// next();
303-
// });
304-
305-
serviceReq.send();
325+
// SEND REQUEST TO PRERENDETING ENDPOINT
326+
serviceReq.end();
306327
} catch (e) {
307328
_warn('Exception while connecting to external service:', e);
308329
next();

0 commit comments

Comments
 (0)