Skip to content
This repository was archived by the owner on Nov 10, 2023. It is now read-only.

Commit 52b8dcc

Browse files
author
kalise
committed
first commit
0 parents  commit 52b8dcc

File tree

1,101 files changed

+785738
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,101 files changed

+785738
-0
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.DS_Store

README.md

+166
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# wscraper
2+
3+
wscraper.js is a web scraper agent written in node.js and based on [cheerio.js][0] a fast, flexible, and lean implementation of core jQuery;
4+
It is built on top of [request.js][1] and inspired by [http-agent.js][2];
5+
6+
## Usage
7+
8+
There are two ways to use wscraper: http agent mode and local mode.
9+
10+
### HTTP Agent mode
11+
In HTTP Agent mode, pass it a host, a list of URLs to visit and a scraping JS script. For each URLs, the agent makes a request, gets the response, runs the scraping script and returns the result of the scraping. Valid usage is:
12+
13+
```js
14+
// scrape a single page from a web site
15+
var agent = wscraper.createAgent();
16+
agent.start('google.com', '/finance', script);
17+
18+
// scrape multiple pages from a website
19+
wscraper.start('google.com', ['/', '/finance', '/news'], script);
20+
```
21+
22+
The URLs should be passed as an array of strings. In case only one page needs to be scraped, the URL can be passed as a single string. Null or empty URLs are treated as root '/'. Suppose you want to scrape from http://google.com/finance website the stocks price of the following companies: Apple, Cisco and Microsoft.
23+
24+
```js
25+
// load node.js libraries
26+
var util = require('util');
27+
var wscraper = require('wscraper');
28+
var fs = require('fs');
29+
30+
// load the scraping script from a file
31+
var script = fs.readFileSync('/scripts/googlefinance.js');
32+
33+
var companies = ['/finance?q=apple', '/finance?q=cisco', '/finance?q=microsoft'];
34+
35+
// create a web scraper agent instance
36+
var agent = wscraper.createAgent();
37+
38+
agent.on('start', function (n) {
39+
util.log('[wscraper.js] agent has started; ' + n + ' path(s) to visit');
40+
});
41+
42+
agent.on('done', function (url, price) {
43+
util.log('[wscraper.js] data from ' + url);
44+
// display the results
45+
util.log('[wscraper.js] current stock price is ' + price + ' USD');
46+
// next item to process if any
47+
agent.next();
48+
});
49+
50+
agent.on('stop', function (n) {
51+
util.log('[wscraper.js] agent has ended; ' + n + ' path(s) remained to visit');
52+
});
53+
54+
agent.on('abort', function (e) {
55+
util.log('[wscraper.js] getting a FATAL ERROR [' + e + ']');
56+
util.log('[wscraper.js] agent has aborted');
57+
process.exit();
58+
});
59+
60+
// run the web scraper agent
61+
agent.start('www.google.com', companies, script);
62+
```
63+
64+
The scraping script should be pure client JavaScript, including JQuery selectors. See [cheerio.js][0] for details. I should return a valid JavaScript object.
65+
The scraping script is passed as a string and usually is read from a file. You can scrape different websites without change any line of the main code: only write different JavaScript scripts.
66+
The scraping script is executed in a sandbox using a separate VM context and the script errors are caught without crash of the main code.
67+
68+
At time of writing, google.com/finance website reports financial data of public companies as in the following html snippet:
69+
70+
```html
71+
...
72+
<div id="price-panel" class="id-price-panel goog-inline-block">
73+
<div>
74+
<span class="pr">
75+
<span id="ref_22144_l">656.06</span>
76+
</span>
77+
</div>
78+
</div>
79+
...
80+
```
81+
By using JQuery selectors, we design the scraping script "googlefinance.js" to find the current value of a company stocks and return it as a text:
82+
83+
```js
84+
/*
85+
86+
googlefinance.js
87+
88+
$ -> is the DOM document to be parsed
89+
result -> is the object containing the result of parsing
90+
*/
91+
92+
result = {};
93+
price = $('div.id-price-panel').find('span.pr').children().text();
94+
result.price = price;
95+
96+
// result is '656.06'
97+
```
98+
99+
### Local mode
100+
Sometimes, you need to scrape local html files without make a request to a remote server. Wscraper can be used as inline scraper. It takes an html string and a JS scraping script. The scraper runs the scraping script and returns the result of the scraping. Valid usage is:
101+
102+
```js
103+
var scraper = wscraper.createScraper();
104+
scraper.run(html, script);
105+
```
106+
107+
Only as trivial example, suppose you want to replace the class name of <div> elements only containing an image with a given class. Create a scraper:
108+
109+
```js
110+
// load node.js libraries
111+
var util = require('util');
112+
var fs = require('fs');
113+
var wscraper = require('wscraper');
114+
115+
// load your html page
116+
var html = fs.readFileSync('/index.html');
117+
118+
// load the scraping script from a file
119+
var script = fs.readFileSync('/scripts/replace.js');
120+
121+
// create the scraper
122+
var scraper = wscraper.createScraper();
123+
124+
scraper.on('done', function(result) {
125+
// do something with the result
126+
util.log(result)
127+
});
128+
129+
scraper.on('abort', function(e) {
130+
util.log('Getting error in parsing: ' + e)
131+
});
132+
133+
// run the scraper
134+
scraper.run(html, script);
135+
```
136+
137+
By using JQuery selectors, we design the scraping script "replace.js" to find the <div> elements containing images with class="MyPhotos" and replace each of them with a <div> element having class="Hidden" without any image inside.
138+
139+
```js
140+
/*
141+
replace.js
142+
143+
$ -> is the DOM document to be parsed
144+
result -> is the final JSON string containing the result of parsing
145+
use var js-obj = JSON.parse(result) to get a js object from the json string
146+
use JSON.stringify(js-obj) to get back a json string from the js object
147+
*/
148+
149+
result = {};
150+
var imgs = $('img.MyPhotos').toArray();
151+
$.each(imgs, function(index, elem) {
152+
var parentdiv = $(elem).parent();
153+
var newdiv = $('<div class="Hidden"/></div>');
154+
$(elem).parent().replaceWith(newdiv)
155+
});
156+
157+
result.replaced = $.html() || '';
158+
```
159+
160+
Happy scraping!
161+
162+
### Author: kalise © 2012 MIT Licensed;
163+
164+
[0]: https://github.com/MatthewMueller/cheerio
165+
[1]: https://github.com/mikeal/request
166+
[2]: https://github.com/indexzero/http-agent

lib/wscraper.js

+165
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
/*
2+
* wscraper.js: a web scraper agent based on cheerio.js a fast, flexible, and lean implementation of core jQuery;
3+
* built on top of request.js;
4+
* inspired by http-agent.js;
5+
*
6+
* (C) 2012 Kalise
7+
* MIT LICENSE
8+
*
9+
*/
10+
11+
var fs = require('fs'),
12+
util = require('util'),
13+
EventEmitter = require("events").EventEmitter,
14+
vm = require('vm'),
15+
request = require('request'),
16+
cheerio = require('cheerio'),
17+
Iconv = require('iconv').Iconv;
18+
19+
exports.createAgent = function () {
20+
return new WebScraper();
21+
};
22+
23+
var WebScraper = function () {
24+
EventEmitter.call(this);
25+
this.host = '';
26+
this.paths = [];
27+
this.script = '';
28+
this.sandbox = {
29+
$: '', // $ -> is the DOM document to be parsed
30+
result: {} // result -> is the JSON object containing the result of parsing
31+
};
32+
this.running = false;
33+
this.unvisited = [];
34+
this.options = {
35+
uri: '',
36+
method: 'GET',
37+
headers: { 'accept-charset':'UTF-8', 'accept':'text/html' },
38+
encoding: null
39+
};
40+
};
41+
42+
util.inherits(WebScraper, EventEmitter);
43+
44+
WebScraper.prototype.start = function(host, paths, script) {
45+
if (!this.running) {
46+
this.running = true;
47+
this.host = host || 'localhost';
48+
if ((paths instanceof Array) && paths.length) {
49+
this.paths = paths
50+
};
51+
if (typeof paths === 'string') {
52+
this.paths[0] = paths
53+
};
54+
this.script = script || '';
55+
// in javascript, assigning an array or an object to a variable makes a reference to the value,
56+
// so we are using the slice(0) function to make a copy of the array.
57+
this.unvisited = this.paths.slice(0);
58+
this.emit('start', this.paths.length);
59+
this.next();
60+
}
61+
else util.log('[wscraper.j] agent is still running, use agent.stop() before to start it again');
62+
};
63+
64+
WebScraper.prototype.stop = function() {
65+
if (this.running) {
66+
this.running = false;
67+
this.emit('stop', this.unvisited.length);
68+
}
69+
else util.log('[wscraper.j] agent is not running, use agent.start() before to stop it');
70+
};
71+
72+
WebScraper.prototype.next = function() {
73+
if (this.running) {
74+
if (this.unvisited.length > 0) {
75+
var path = this.unvisited.shift();
76+
var url = '';
77+
if (path.indexOf('/') == 0) {
78+
url = 'http://' + this.host + path;
79+
} else {
80+
url = 'http://' + this.host + '/' + path;
81+
};
82+
util.log('[wscraper.js] sending a request to: ' + url);
83+
this.options.uri = url;
84+
var self = this;
85+
request(self.options, function (error, response, body) {
86+
// currently only 200 Ok code is expected as valid for web scraping
87+
// TODO: handle 3XX (redirections) status codes
88+
if (error || response.statusCode !=200) {
89+
self.emit('abort', 'error or bad response from ' + url);
90+
return
91+
};
92+
var data = body || {};
93+
// check the enconding header in the response.headers['content-type'] in order to understand the encoding used by the server
94+
// TODO: support all conversions supported by iconv.js
95+
var encoding = 'UTF-8';
96+
if (response.headers['content-type'].match('charset=ISO-8859-1')) {
97+
encoding = 'ISO-8859-1';
98+
};
99+
if (encoding != 'UTF-8') { // convert data stream from ISO-8859-1 to UTF-8 encoding
100+
var iconv = new Iconv(encoding, 'UTF-8');
101+
data = iconv.convert(body);
102+
}
103+
// load the data in the sandbox
104+
self.sandbox.$ = cheerio.load(data.toString());
105+
try {
106+
// run the script in the sandbox
107+
vm.runInNewContext(self.script, self.sandbox);
108+
} catch (e) {
109+
self.emit('abort', e); // catch any error from the script
110+
return;
111+
}
112+
if (self.sandbox.result) {
113+
self.emit('done', url, self.sandbox.result)
114+
} else {
115+
self.emit('abort', 'parsing script is returning null value!')
116+
};
117+
})
118+
}
119+
else {
120+
this.stop();
121+
}
122+
}
123+
else util.log('[wscraper.j] agent is not running, start it by calling agent.start()');
124+
};
125+
126+
// use of the Scraper object without make any http request
127+
exports.createScraper = function () {
128+
return new Scraper();
129+
};
130+
131+
var Scraper = function () {
132+
EventEmitter.call(this);
133+
this.html = '';
134+
this.script = '';
135+
this.sandbox = {
136+
$: '', // $ -> is the DOM document to be parsed
137+
result: {} // result -> is the JSON object containing the result of parsing
138+
};
139+
};
140+
141+
util.inherits(Scraper, EventEmitter);
142+
143+
Scraper.prototype.run = function (html, script) {
144+
this.html = html || '';
145+
this.script = script || '';
146+
this.emit('run');
147+
this.sandbox.$ = cheerio.load(this.html);
148+
// run the loaded script in a sandbox
149+
try {
150+
vm.runInNewContext(this.script.toString(), this.sandbox);
151+
// emit the "done" event and pass the result to the callback function
152+
} catch (e) {
153+
this.emit('abort', e);
154+
return;
155+
}
156+
if (this.sandbox.result) {
157+
this.emit('done', this.sandbox.result)
158+
} else {
159+
this.emit('abort', 'parsing script is returning null value!')
160+
};
161+
}
162+
163+
164+
165+

node_modules/cheerio/.npmignore

+6
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

node_modules/cheerio/.travis.yml

+4
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)