Features

Scraper is a toy to extract web data just by mouse.

All css styles and layouts come from https://dexi.io.
All images including logo comes from https://dexi.io.
All functions come from https://dexi.io. but are self-implemented.

Dexi.io provides such amazing tool, so I just want to make a little copy from it.

Here is a gif to show the little toy how to work.

Open a web page in Scraper.
Move mouse on the page to choose which part to extract.
Or select DOM element in Elements tab like using Chrome Web Debug tool to select element.
All these operations will be written out to a robot definition file.
Run means running theses operations defined in robot JSON file.
Result tab shows the extracted raw text.

example 1:

example 2:

example 3:

Features

Not need coding, but need web concepts include html,XPath.
using XPath selector , but dexi.io using CSS selector.
using phantomjs engine at backend

Installation

To install scraper, you should install rails and download phantomjs.

After install rails.

For Centos ,Mac and Windows, you should update scraper project

cd <scraper root>
bundle update

then all dependencies will be installed.

For Mac, installing nokogiri gem may have some problem, but keep some patience.

If you cannot link http://rubygems.org or https://rubygems.org, you had better use proxy to update.

Set path for phantomjs

Edit ./config/scraper.yml and modify the path.

The path must be like

development:
  phantomjs_full_path : /home/ym/ym/phantomjs/bin/phantomjs

For Windows, the path is like this

development:
  phantomjs_full_path : d:/work/bin/phantomjs.exe

Start scraper

Enter scraper root directory, start server.

cd <scraper root>
rails server

For Windows, you can run cmd.exe to open a cmdline window.

then server will start successfully like this below.

[ym@centos7 scraper]$ rails server
=> Booting WEBrick
=> Rails 4.2.6 application starting in development on http://localhost:3000
=> Run `rails server -h` for more startup options
=> Ctrl-C to shutdown server
[2016-11-03 17:59:46] INFO  WEBrick 1.3.1
[2016-11-03 17:59:46] INFO  ruby 2.2.1 (2015-02-26) [x86_64-linux]
[2016-11-03 17:59:46] INFO  WEBrick::HTTPServer#start: pid=4052 port=3000

Then you can open chrome browser(not support IE or Firefox or Safari), and open http://127.0.0.1:3000.

Only support Chrome browser

You know scraper is a toy, so I just develop it using my favorite browser.

Documentation and Help

If you don't know how to play this little toy, you can search some document for dexi.io.

Contributing

This project is just a little toy, so it has much more bugs. If you find them ,please let me know.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.idea		.idea
app		app
bin		bin
config		config
db		db
lib		lib
log		log
public		public
test		test
vendor/assets		vendor/assets
.gitignore		.gitignore
Bowerfile		Bowerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
README.rdoc		README.rdoc
Rakefile		Rakefile
config.ru		config.ru
contributors.txt		contributors.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Installation

After install rails.

Set path for phantomjs

Start scraper

Only support Chrome browser

Documentation and Help

Contributing

About

Releases

Packages

Languages

License

yanggeorge/scraper

Folders and files

Latest commit

History

Repository files navigation

Features

Installation

After install rails.

Set path for phantomjs

Start scraper

Only support Chrome browser

Documentation and Help

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages