Skip to content

Commit

Permalink
Version 0.3.6 (#38)
Browse files Browse the repository at this point in the history
* fix edit sitemap metadata

* Fixed images scraping (#21)

* rels #20. fix image selector

* rels #20. restored base64 asset

* rels #20. download each file once

* rels #20. base64 to promises

* rels #20. SelectorImage._getData to promises

* rels #20. Scraper.saveFile to promises

* rels #20. fix

* rels #20. create unique download paths

* rels #20. new version!

* Sitemap export (#19)

* Added downloading sitemap as Json file

* Now we have download button of sitemap json. But we cannot download, i dont know why.

* Fix for download json.

* Added import file field

* Added import sitemap from file with some bugs

* Fixed bug of russian language in imported sitemap

* Fixed some bugs of import sitemap from file

* Hide double import sitemap

* Fixed style of import sitemap panel

* Buttons in import sitemap panel now have equal style

* Fixed version of export and import sitemap panel

* Fixed style problems

* Import and export sitemap style fixed

* Fixed bug with validation _id field in import panel

* Fixed submission import sitemap v2.0

* fixed validation import field

* Fixed bug import panel validation v3.0

Co-authored-by: roman <[email protected]>
Co-authored-by: alexander <[email protected]>

* fixed  initSelectorValidation()

* Fixed bug of empty id field

* fixed initSelectorValidation()

* Added horizontal tables with complicated header (#15)

* added table to tables.html,
added new function of indexation of columns,
fixed verticalTables bag,
needed to fix getTableHeaderColumnsFromHTML to get complicated columnName

* some changes for column table

* Fixed selectorTable for complicated horizontal tables. Need to fix bag for vertical tables

* Added tables with complicated header and vertical tables.

* fixup! Added tables with complicated header and vertical tables.

* Added table in tables.html with complicated rows

* Added rowspan support in table

* Fixed comunsMaker and simplified calling this function

* Added new version of columnsMaker

* Prettier version of columnMaker

* Added table in tables.html with comlicated rows

* Refactored playground/tables.html.

* SelectorTable columnsMaker refactoring.

* Refactoring horizontal cells

* More refactoring on SelectorTable.

* Refactored missing columns. And automatic detection of attributes after select. Extracted common part of getHeaders into getHeaderColumnIndices.

* Moved refresh button to Table columns.

* Version upgrade.

* Refactored SelectorTable after review.

Co-authored-by: roman <[email protected]>
Co-authored-by: Alexander <[email protected]>
Co-authored-by: Yatskov <[email protected]>

* Json viewer (#24)

* First implementation of JSON Viewer.

* Refactored BrowseData and DataPreview. Know we see list of pretty JSON.

* Fix for bootstrap version. Upgrade of bootstrap caused some defects in removeHTML and trim text checboxes.

* Model refactoring  (#33)

* Refactored Model, Sitemap and Stores.

* Updated storages.

* Moved RestApi store to promises. Load Templates on promises.

* After Select fixes wip.

* Completed after select refactoring. And refactored model class.

* Updated yarn.lock after merge.

* Fixes after review.

* Fixes after review.

Co-authored-by: Yatskov Alexander <[email protected]>

* Issue#30 (#34)

* Edited documentation for Selector Table

* Added extractMissingColumns in documentation of table selector

* Edited Documentation of selector Table.

* Added renaming columns picture and some docs fixes.

Co-authored-by: Yatskov <[email protected]>

* Translations (#32)

* Template for adding locales. Translated Viewport and SitemapList. Also example for error message on sitemap id validation.

* Added translation for some views

* Added editMetadata translate

* Bug with russian version

* Added placeholder for SelectorEdit and SitemapScrapeConfig

* Added selector type and fixed some bugs

* Added placeholders translation.

* Added translation for popup.html and options.html with bug

* Some code

* Translation for validation in controller

* Added 90% of translation, but needed to activate i18n in options, popup. Also have some bug.

* Addex examples for translating content script, popup and options.

* Fix for export sitemap and export data.

* Added translation

* Revert "Added translation"

This reverts commit 112e846

* Added toolbar translation

* Fix dataPreview translation

* Fix textmanipulation bug

* Fixed incorrect translation of 'list'

* Created class Translate

* Fixed problem with deleting input field for textmanipulation

* Beautified code for translator

* Some fixes

* Fixed translation elements without key

* Fixed translation

* Fix for return.

* Moved to browser i18n.

* Fixes to messages translations.

* Fix change bug. And make couchdb possible in russian after switching language.

* Fix formatting in background.js.

* Fixed formating in editmetadata.

* Fixed formating in edit sitemap / create sitemap.

* Moved AttachedToolbar to specific html and deduplicated model and url patterns hints.

Co-authored-by: GooDRomka <[email protected]>
Co-authored-by: Yatskov <[email protected]>

* Fixed translation and added translation for search in data.

* Refactored data extraction format, added export in JSON lines (#36)

* rels #27. minimal working example

* rels #27. refactored SelectorDocument and SelectorImage

* rels #27. async _getData in text, html, attr and style selectors

* rels #27. refactored element selectors

* rels #27. refactored link selectors

* rels #27. refactored constant value and input value

* rels #27. group and table

* rels #27. import jquery everywhere

* rels #27. get url attribute with jquery for links

* rels #27. Export data in JSON Lines

* rels #27. flatten data objects for CSV export

* rels #27. fixes

* rels #27. _attachments

* rels #27. include attachments in csv export properly

* rels #27. remove useless data normalization in pouch store

* rels #27. extract filename from Content-Disposition header

* rels #27. md5 checksums for attacments

* rels #27. add meta attributes (url, ts) for items

* rels #27. trim generated paths instead of original filenames

* rels #27. XXX fix to prevent crashes on sitemap creation

* rels #27. fix SelectorGroup

* rels #27. fix item urls in data preview

* rels #27. fixes to export data form

delay annoying alert, translate delimiter error, change download button color for consistency

* rels #27. version++

* rels #27. refactored data export

* rels #27. fix

* rels #27. made some buttons blue

* Fixed translations bug for selectors types.

Co-authored-by: Yatskov <[email protected]>

* Updated documentation (#37)

* Updated changelog and intallation

* Updated installation and chrome install pic.

* Fixed installation text.

* Added Rest Api storage to docs.

* Updated changelog.

* Updated reference to install guide.

* Fixed install pic positioning.

* Added constant and docs selector into Selectors.md list.

* Forgotten style and input value.

* Some broke refs fix.

Co-authored-by: Maksim Varlamov <[email protected]>
Co-authored-by: Max Varlamov <[email protected]>
Co-authored-by: GooDRomka <[email protected]>
Co-authored-by: roman <[email protected]>
Co-authored-by: Yatskov <[email protected]>
  • Loading branch information
6 people authored Apr 28, 2020
1 parent 9f350cd commit f14e2c6
Show file tree
Hide file tree
Showing 98 changed files with 5,222 additions and 7,594 deletions.
3 changes: 2 additions & 1 deletion .babelrc
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
{
"plugins": [
"@babel/plugin-proposal-optional-chaining"
"@babel/plugin-proposal-optional-chaining",
"@babel/plugin-transform-runtime"
],
"presets": [
["@babel/preset-env", {
Expand Down
2 changes: 1 addition & 1 deletion .prettierrc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"arrowParens": "avoid",
"bracketSpacing": true,
"printWidth": 180,
"printWidth": 100,
"semi": true,
"singleQuote": true,
"tabWidth": 4,
Expand Down
112 changes: 66 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,86 @@
# Web Scraper
Web Scraper is a chrome browser extension built for data extraction from web
pages. Using this extension you can create a plan (sitemap) how a web site
should be traversed and what should be extracted. Using these sitemaps the
Web Scraper will navigate the site accordingly and extract all data. Scraped
data later can be exported as CSV.

Web Scraper is a chrome browser extension built for data extraction from web
pages. Using this extension you can create a plan (sitemap) how a web site
should be traversed and what should be extracted. Using these sitemaps the
Web Scraper will navigate the site accordingly and extract all data. Scraped
data later can be exported as CSV or JSON Lines.

#### Latest Version
To run the latest version you need to [download the project][latest-releases] to your system and [follow the description on Google][get-started-chrome]) (select the `extension` folder).


Read about installation process on [installation page](./docs/Installation.md).

## Changelog

### v0.3.6

- Updated support for Tables (update vertical tables support and added complex headers and data rows)
- Added export and import sitemap from file
- Added Russian translations and support of i18n that make possible to add every language translation
- Added Rest Api CRUD storage for sitemaps
- Moved to webpack bundler
- Added id hints from predefined model
- Added selectors for Constants and Documents
- Refactored preview data and added search in scraped data
- Refactored returned items model to JSON
- Added saving in JSON lines

### v0.3
* Enabled pasting of multible start URLs (by [@jwillmer](https://github.com/jwillmer))
* Added scraping of dynamic table columns (by [@jwillmer](https://github.com/jwillmer))
* Added style extraction type (by [@jwillmer](https://github.com/jwillmer))
* Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by [@jwillmer](https://github.com/jwillmer))
* Added image improvements to find images in div background (by [@jwillmer](https://github.com/jwillmer))
* Added support for vertical tables (by [@jwillmer](https://github.com/jwillmer))
* Added random delay function between requests (by [@Euphorbium](https://github.com/Euphorbium))
* Start URL can now also be a local URL (by [@3flex](https://github.com/3flex))
* Added CSV export options (by [@mohamnag](https://github.com/mohamnag))
* Added Regex group for select (by [@RuneHL](https://github.com/RuneHL))
* JSON export/import of settings (by [@haisi](https://github.com/haisi))
* Added date and number pattern in URL (by [@codoff](https://github.com/codoff))
* Added pagination selector limit (by [@codoff](https://github.com/codoff))
* Improved CSV export (by [@haisi](https://github.com/haisi))
* Added click limit option (by [@panna-ahmed](https://github.com/panna-ahmed))

- Enabled pasting of multiple start URLs (by [@jwillmer](https://github.com/jwillmer))
- Added scraping of dynamic table columns (by [@jwillmer](https://github.com/jwillmer))
- Added style extraction type (by [@jwillmer](https://github.com/jwillmer))
- Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by [@jwillmer](https://github.com/jwillmer))
- Added image improvements to find images in div background (by [@jwillmer](https://github.com/jwillmer))
- Added support for vertical tables (by [@jwillmer](https://github.com/jwillmer))
- Added random delay function between requests (by [@Euphorbium](https://github.com/Euphorbium))
- Start URL can now also be a local URL (by [@3flex](https://github.com/3flex))
- Added CSV export options (by [@mohamnag](https://github.com/mohamnag))
- Added Regex group for select (by [@RuneHL](https://github.com/RuneHL))
- JSON export/import of settings (by [@haisi](https://github.com/haisi))
- Added date and number pattern in URL (by [@codoff](https://github.com/codoff))
- Added pagination selector limit (by [@codoff](https://github.com/codoff))
- Improved CSV export (by [@haisi](https://github.com/haisi))
- Added click limit option (by [@panna-ahmed](https://github.com/panna-ahmed))

### v0.2
* Added Element click selector
* Added Element scroll down selector
* Added Link popup selector
* Improved table selector to work with any html markup
* Added Image download
* Added keyboard shortcuts when selecting elements
* Added configurable delay before using selector
* Added configurable delay between page visiting
* Added multiple start url configuration
* Added form field validation
* Fixed a lot of bugs

- Added Element click selector
- Added Element scroll down selector
- Added Link popup selector
- Improved table selector to work with any html markup
- Added Image download
- Added keyboard shortcuts when selecting elements
- Added configurable delay before using selector
- Added configurable delay between page visiting
- Added multiple start url configuration
- Added form field validation
- Fixed a lot of bugs

### v0.1.3
* Added Table selector
* Added HTML selector
* Added HTML attribute selector
* Added data preview
* Added ranged start urls
* Fixed bug which made selector tree not to show on some operating systems

- Added Table selector
- Added HTML selector
- Added HTML attribute selector
- Added data preview
- Added ranged start urls
- Fixed bug which made selector tree not to show on some operating systems

#### Bugs

When submitting a bug please attach an exported sitemap if possible.

#### Development

Read the [Development Instructions](/docs/Development.md) before you start.

## License

LGPLv3

[chrome-store]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn
[webscraper.io]: http://webscraper.io/
[google-groups]: https://groups.google.com/forum/#!forum/web-scraper
[github-issues]: https://github.com/martinsbalodis/web-scraper-chrome-extension/issues
[get-started-chrome]: https://developer.chrome.com/extensions/getstarted#unpacked
[issue-14]: https://github.com/jwillmer/web-scraper-chrome-extension/issues/14
[latest-releases]: https://github.com/jwillmer/web-scraper-chrome-extension/releases
[chrome-store]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn
[webscraper.io]: http://webscraper.io/
[google-groups]: https://groups.google.com/forum/#!forum/web-scraper
[github-issues]: https://github.com/martinsbalodis/web-scraper-chrome-extension/issues
[get-started-chrome]: https://developer.chrome.com/extensions/getstarted#unpacked
[latest-releases]: https://github.com/ispras/web-scraper-chrome-extension/releases
46 changes: 39 additions & 7 deletions docs/Installation.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,44 @@
# Installation

You can install the extension from [Chrome store] [1]. After installing it you
should restart chrome to make sure the extension is fully loaded. If you don't
want to restart Chrome then use the extension only in tabs that are
created after installing it.

## Requirements

The extension requires Chrome 31+ . There are no OS limitations.
This extension supports different browsers at least Chrome and Mozilla Firefox. Opera is also possible browser fo use.

# Install Chrome

To install the program, you must perform the following actions:

1. Unzip the file with the plugin "web-scraper-chrome-extension-v <version number> .zip" downloaded from [release page][latest-realease].
2. Go to the extensions page in the Google Chrome browser - [chrome://extensions/](chrome://extensions/).
3. Enable developer mode using the switch in the upper right corner (if not enabled). (You can read more here https://developer.chrome.com/extensions/getstarted#unpacked).
4. You can add an extension to the browser in the following ways:
1. Using the drag and drop system, move the folder `web-scraper-chrome-extension-v<version number>` obtained from the unzipped file to the extensions page;
2. Download the unpacked extension from the folder `web-scraper-chrome-extension-v<version number>` obtained from the unzipped file. As a result of the actions performed, a new Web Scraper extension should appear in the list of Google Chrome browser extensions:
5. For the extension to work correctly, after installing it, restart chrome so that the extension works properly.
![Fig. Installing the program in Google Chrome][install-chrome]

# Install Mozilla Firefox

To install the program, you must perform the following actions:

1. Go to the Mozilla Firefox browser settings page. about: config.
2. In the search bar, enter the xpinstall.signatures.required setting and press Enter.
3. Set the value of this setting to false (double-click on the settings line).
![Fig. Modifying Mozilla Firefox][change-config]
4. Go to the add-ons page (extensions) of the Mozilla Firefox browser. about: addons
5. Open the settings menu by clicking on the corresponding icon.
6. Select the menu item "install addon from file" in the drop-down list.
![Fig. Install Add-on][install-addon]
7. Select the file with the plugin "web-scraper-chrome-extension-v <version number> .zip" provided on the disk with the distribution package of the program.
8. Click on the file selection confirmation button.
![Fig. Selecting a program distribution file][choose-addon-file]
9. Confirm the installation of the extension in the pop-up window.

![Fig. Confirm install extension][confirm-install]

[1]: https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn "Install web scraper from Chrome store"
[install-chrome]: images/installation/chrome_scraper_1.png
[change-config]: images/installation/Firefox_scraper_1.png
[install-addon]: images/installation/Firefox_scraper_2.png
[choose-addon-file]: images/installation/Firefox_scraper_3.png
[confirm-install]: images/installation/Firefox_scraper_4.png
[latest-releases]: https://github.com/ispras/web-scraper-chrome-extension/releases
12 changes: 6 additions & 6 deletions docs/Open Web Scraper.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Open Web Scraper

Web Scraper is integrated into chrome Developer tools. Figure 1 shows how you
Web Scraper is integrated into Developer tools. Figure 1 shows how you
can open it. You can also use these shortcuts to open Developer tools. After
opening Developer tools open *Web Scraper* tab.
opening Developer tools open _Web Scraper_ tab.

Shourtcuts:
Shortcuts:

* windows, linux: `Ctrl+Shift+I`, `f12`, open `Tools / Developer tools`
* mac `Cmd+Opt+I`, open `Tools / Developer tools`
- windows, linux: `Ctrl+Shift+I`, `f12`, open `Tools / Developer tools`
- mac `Cmd+Opt+I`, open `Tools / Developer tools`

![Fig. 1: Open Web Scraper][open-web-scraper]

[open-web-scraper]: images/open-web-scraper/open-web-scraper.png?raw=true
[open-web-scraper]: images/open-web-scraper/open-web-scraper.png?raw=true
51 changes: 25 additions & 26 deletions docs/Scraping a site.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Open the site that you want to scrape.

## Create Sitemap

The first thing you need to do when creating a *sitemap* is specifying the
The first thing you need to do when creating a _sitemap_ is specifying the
start url. This is the url from which the scraping will start. You can also
specify multiple start urls if the scraping should start from multiple places.
For example if you want to scrape multiple search results then you could create
Expand All @@ -13,36 +13,36 @@ a separate start url for each search result.
### Specify multiple urls with ranges

In cases where a site uses numbering in pages URLs it is much simpler to create
a range start url than creating *Link selectors* that would navigate the site.
a range start url than creating _Link selectors_ that would navigate the site.
To specify a range url replace the numeric part of start url with a range
definition - `[1-100]`. If the site uses zero padding in urls then add zero
padding to the range definition - `[001-100]`. If you want to skip some urls
then you can also specify incremental like this `[0-100:10]`.

Use range url like this `http://example.com/page/[1-3]` for links like these:

* `http://example.com/page/1`
* `http://example.com/page/2`
* `http://example.com/page/3`
- `http://example.com/page/1`
- `http://example.com/page/2`
- `http://example.com/page/3`

Use range url with zero padding like this `http://example.com/page/[001-100]`
for links like these:

* `http://example.com/page/001`
* `http://example.com/page/002`
* `http://example.com/page/003`
- `http://example.com/page/001`
- `http://example.com/page/002`
- `http://example.com/page/003`

Use range url with increment like this `http://example.com/page/[0-100:10]` for
links like these:

* `http://example.com/page/0`
* `http://example.com/page/10`
* `http://example.com/page/20`
- `http://example.com/page/0`
- `http://example.com/page/10`
- `http://example.com/page/20`

## Create selectors

After you have created the *sitemap* you can add selectors to it. In the
*Selectors* panel you can add new selectors, modify them and navigate the
After you have created the _sitemap_ you can add selectors to it. In the
_Selectors_ panel you can add new selectors, modify them and navigate the
selector tree.
The selectors can be added in a tree type structure. The web scraper will
execute the selectors in the order how they are organized in the tree
Expand All @@ -52,10 +52,10 @@ example site.

![Fig. 1: News site][image-news-site]

To scrape this site you can create a *Link selector* which will extract all
To scrape this site you can create a _Link selector_ which will extract all
article links in the first page. Then as a child selector you can add a
*Text selector* that will extract articles from the article pages that the
*Link selector* found links to. Image below illustrates how the *sitemap*
_Text selector_ that will extract articles from the article pages that the
_Link selector_ found links to. Image below illustrates how the _sitemap_
should be built for the news site.

![Fig. 2: News site sitemap][image-news-site-sitemap]
Expand All @@ -66,31 +66,30 @@ to ensure that you have selected the correct elements with the correct data.
More information about selector tree building is available in selector
documentation. You should atleast read about these core selectors:

* [Text selector][text-selector]
* [Link selector][link-selector]
* [Element selector][element-selector]
- [Text selector][text-selector]
- [Link selector][link-selector]
- [Element selector][element-selector]

### Inspect selector tree

After you have created selectors for the *sitemap* you can inspect the tree
After you have created selectors for the _sitemap_ you can inspect the tree
structure of selectors in the Selector graph panel. Image below shows an
example selector graph.

![Fig. 3: News site selector graph][image-news-site-selector-graph]

## Scrape the site

After you have created selectors for the *sitemap* you can start scraping. Open
*Scrape* panel and start scraping. A new popup window will open in which the
After you have created selectors for the _sitemap_ you can start scraping. Open
_Scrape_ panel and start scraping. A new popup window will open in which the
scraper will load pages and extract data from them. After the scraping is done
the popup window will close and you will be notified with a popup message. You can view
the scraped data by opening *Browse* panel and export it by opening the
*Export data as CSV* panel.

the scraped data by opening _Browse_ panel and export it by opening the
_Export data_ panel.

[image-news-site]: images/scraping-a-site/news-site.png?raw=true
[image-news-site-sitemap]: images/scraping-a-site/news-site-sitemap.png?raw=true
[image-news-site-selector-graph]: images/scraping-a-site/news-site-selector-graph.png?raw=true
[text-selector]: Selectors/Text%20selector.md
[link-selector]: Selectors/Link%20selector.md
[element-selector]: Selectors/Element%20selector.md
[element-selector]: Selectors/Element%20selector.md
Loading

0 comments on commit f14e2c6

Please sign in to comment.