The goal of this project is to create an extensible system for extracting data from web pages. Currently it is using Selenium WebDriver (via php-webdriver), QueryPath, and a configuration file which specifies which components to extract and how to output the results.
The "job" configuration file defines all of the aspects of the system (database, infrastructure) and the web site and the data you wish to extract.
It is in XML and has the following options:
- Child element "site" must be defined
- Child element "steps" are recommended as they drive actions
Database
Currently a single MySQL database is accepted. If elements are defind the XML will be imported into the database->table per the specifications in the Configuration File
Actions
- Click
- Type
- Captcha
Elements
- Input - CSS Selectors used by QueryPath to pull data from a web page
- Output - Element name of Output XML
Samples are included in the /examples folder.
The definitions in the configuration define how the output will be formatted (element names).
GET THE CODE
git clone [email protected]:kjenney/php-webminer.git
Add the dependency. https://packagist.org/packages/kjenney/php-webminer
{
"require": {
"kjenney/php-webminer": "dev-master"
}
}
BUILD WITH DEPENDENCIES
Download the composer.phar
curl -sS https://getcomposer.org/installer | php
Install the library.
php composer.phar install
Install PHP5 Extensions
apt-get install php5-tidy
yum install php-tidy
apt-get install php5-mysqlnd
Install Tesseract (optional)
apt-get install tesseract-ocr
-
All you need as the server for this client is the selenium-server-standalone-#.jar file provided here: http://www.seleniumhq.org/download/
-
Download and run that file, replacing # with the current server version.
java -jar selenium-server-standalone-#.jar
-
There's still a lot of work that needs to be done, but I welcome any help and/or suggestions.
-
Feel free to create issues and recommend features.