Skip to content

Commit c6f42c1

Browse files
authoredFeb 1, 2019
docs: cleanup and update docs (#238)
1 parent 92de5ce commit c6f42c1

File tree

3 files changed

+43
-34
lines changed

3 files changed

+43
-34
lines changed
 

‎CONTRIBUTING.md

+11-8
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Thank you for your interest in contributing to Mercury Parser! It's people like you that make Mercury such a useful tool. The below guidelines will help answer any questions you may have about the contribution process. We look forward to receiving contributions from you — our community!
44

5-
_Please read our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating in our community._
5+
_Please read our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating._
66

77
## Contents
88

@@ -32,7 +32,7 @@ of contribution and appreciate your help.
3232

3333
Here are a few examples of what we consider a contribution:
3434

35-
- Updates to source code
35+
- Updates to source code, including bug fixes, improvements, or [creating new custom site extractors](./src/extractors/custom/README.md)
3636
- Answering questions and chatting with the community in the [Gitter](https://gitter.im/postlight/mercury) room
3737
- Filing, organizing, and commenting on issues in the [issue tracker](https://github.com/postlight/mercury-parser/issues)
3838
- Teaching others how to use Mercury
@@ -76,7 +76,7 @@ This section of the document outlines how to build, run, and test Mercury locall
7676

7777
### Building
7878

79-
To build the required modules for local development, execute the following commands:
79+
To build the Mercury Parser locally, execute the following commands:
8080

8181
```bash
8282
# Clone this repository from GitHub.
@@ -105,7 +105,7 @@ Mercury is a test-driven application; each component has its own test file. Test
105105
For new code to be accepted, all tests must pass in both environments. To run the required tests for local development, execute the following commands:
106106

107107
```bash
108-
# Run the full test suite for both node and the browser
108+
# Run the full test suite once, for both node and the browser
109109
yarn test
110110

111111
# Run the tests for node build only
@@ -114,8 +114,12 @@ yarn test:node
114114
# Run the tests for web build only
115115
yarn test:web
116116

117-
# Run the tests, then re-run tests on file changes.
118-
# If an optional <test_file> string is passed, only tests matching that string will be re-run.
117+
# Run the tests in node, then re-run tests on file changes.
118+
# If an optional <test_file> string is passed, only tests
119+
# matching that string will be re-run.
120+
#
121+
# E.g., `yarn watch:test nytimes` will run the tests for
122+
# `./src/extractors/custom/www.www.nytimes.com/index.test.js`
119123
yarn watch:test <test_file>
120124
```
121125

@@ -135,8 +139,7 @@ as you develop is up to you.
135139

136140
In addition to enforcing a JavaScript style guide, we also require that Markdown
137141
files pass [remarklint](https://github.com/wooorm/remark-lint) with the recommended
138-
preset. This helps keep our Markdown tidy, consistent, and compatible with a range of
139-
Markdown parsers used for generating documentation.
142+
preset. This helps keep our Markdown tidy and consistent.
140143

141144
### Node.js Version Requirements
142145

‎README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,6 @@ Licensed under either of the below, at your preference:
6565

6666
## Contributing
6767

68-
For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see [CONTRIBUTING.md](https://github.com/postlight/mercury-parser/blob/master/CONTRIBUTING.md)
68+
For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see [CONTRIBUTING.md](./CONTRIBUTING.md)
6969

7070
Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.

‎src/extractors/custom/README.md

+31-25
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,29 @@
11
# Custom Parsers
22

3-
Mercury can extract meaningful content from almost any web site, but custom parsers allow the Mercury parser to find the content more quickly and more accurately than it might otherwise do. Our goal is to include custom parsers as many sites as we can, and we'd love your help!
3+
Mercury can extract meaningful content from almost any web site, but custom parsers/extractors allow the Mercury Parser to find the content more quickly and more accurately than it might otherwise do. Our goal is to include custom parsers as many sites as we can, and we'd love your help!
44

5-
## The basics of parsing a site with a Mercury custom parser
5+
## The basics of parsing a site with a custom parser
66

77
Custom parsers allow you to write CSS selectors that will find the content you're looking for on the page you're testing against. If you've written any CSS or jQuery, CSS selectors should be very familiar to you.
88

99
You can query for every field returned by the Mercury Parser:
1010

11-
- title
12-
- author
13-
- content
14-
- date_published
15-
- lead_image_url
16-
- dek
17-
- next_page_url
18-
- excerpt
11+
- `title`
12+
- `author`
13+
- `content`
14+
- `date_published`
15+
- `lead_image_url`
16+
- `dek`
17+
- `next_page_url`
18+
- `excerpt`
1919

2020
### Using selectors
2121

22+
CSS selectors allow you to target any content in the HTML document for extraction.
23+
2224
#### Basic selectors
2325

24-
To demonstrate, let's start with something simple: Your selector for the page's title might look something like this:
26+
To demonstrate, let's start with something simple. A selector for the page's title might look something like this (you can ignore the boilerplate on top and bottom for now and just focus on the `title` key):
2527

2628
```javascript
2729
export const ExampleExtractor = {
@@ -37,21 +39,23 @@ export const ExampleExtractor = {
3739
...
3840
```
3941
40-
As you might guess, the selectors key provides an array of selectors that Mercury will check to find your title text. In our ExampleExtractor, we're saying that the title can be found in the text of an `h1` header with a class name of `hed`.
42+
As you might guess, the selectors key provides an array of selectors that Mercury will check to find your title text. In our `ExampleExtractor`, we're saying that the title can be found in the text of an `h1` header with a class name of `hed`.
4143
4244
The selector you choose should return one element. If more than one element is returned by your selector, it will fail (and Mercury will fall back to its generic extractor).
4345
46+
Because the `selectors` property returns an array, you to write more than one selector for a property extractor. This is particularly useful for sites that have multiple templates for articles. If you provide an array of selectors, Mercury will try each in order, falling back to the next until it finds a match or exhausts the options (in which case it will fall back to its default generic extractor).
47+
4448
#### Selecting an attribute
4549
46-
Sometimes the information you want to return lives in an element's attribute rather than its text — e.g., sometimes a more exact ISO-formatted date/time will be stored in an attribute of an element.
50+
Sometimes the information you want to return lives in an element's attribute rather than its text — e.g., often a more exact ISO-formatted date/time will be stored in an attribute of an element.
4751
48-
So your element looks like this:
52+
Say your element looks like this:
4953
5054
```html
5155
<time class="article-timestamp" datetime="2016-09-02T07:30:01-04:00"></time>
5256
```
5357
54-
The text you want isn't the text inside a matching element, but rather, inside the datetime attribute. To write a selector that returns an attribute, you provide your custom parser with a two-element array. The first element is your selector; the second element is the attribute you'd like to return.
58+
The text you want isn't the text inside a matching element, but rather, inside the `datetime` attribute. To write a selector that returns an attribute, you provide your custom parser with a two-element array. The first element is your selector; the second element is the attribute you'd like to return.
5559
5660
```javascript
5761
export const ExampleExtractor = {
@@ -69,11 +73,11 @@ export const ExampleExtractor = {
6973
7074
This is all you'll need to know to handle most of the fields Mercury parses (titles, authors, date published, etc.). Article content is the exception.
7175
72-
### Cleaning content
76+
### Cleaning content from an article
7377
7478
An article's content can be more complex than the other fields, meaning you sometimes need to do more than just provide the selector(s) in order to return clean content.
7579
76-
For example, sometimes an article's content will contain related content that doesn't translate or render well when you just want to see the article's content. The clean key allows you to provide an array of selectors identifying elements that should be removed from the content.
80+
For example, sometimes an article's content will contain related content (e.g., _Read also_) that doesn't translate or render well when you just want to see the article. The `clean` key allows you to provide an array of selectors identifying elements that should be removed from the content.
7781
7882
Here's an example:
7983
@@ -98,11 +102,13 @@ export const ExampleExtractor = {
98102
}
99103
```
100104
105+
The above example will first select the content based on either of the two `content` selectors, then it will clean any nodes from the selected content that matches the selectors defined by `clean`.
106+
101107
### Using transforms
102108
103109
Occasionally, in order to mold the article content to a form that's readable outside the page, you need to transform a few elements inside the content you've chosen. That's where `transforms` come in.
104110
105-
This example demonstrates a simple tranform that converts h1 headers to h2 headers, along with a more complex transform that transforms lazy-loaded images to images that will render as you would expect outside the context of the site you're extracting from.
111+
This example demonstrates a simple tranform that converts `h1` headers to `h2` headers, along with a more complex transform that transforms lazy-loaded images to images that will render as you would expect outside the context of the site you're extracting from.
106112
107113
```javascript
108114
export const ExampleExtractor = {
@@ -126,7 +132,7 @@ export const ExampleExtractor = {
126132
// the transformation.
127133

128134
// Convert lazy-loaded noscript images to figures
129-
noscript: ($node) => {
135+
noscript: $node => {
130136
const $children = $node.children();
131137
if ($children.length === 1 && $children.get(0).tagName === 'img') {
132138
return 'figure';
@@ -138,11 +144,11 @@ export const ExampleExtractor = {
138144
},
139145
```
140146
141-
For much more complex tranforms, you can perform dom manipulation within the tranform function, but this is discouraged unless absolutely necessary. See, for example, the lazy-loaded image transform in [the NYTimesExtractor](www.nytimes.com/index.js#L25), which transforms the src attribute on the lazy-loaded image.
147+
For much more complex tranforms, you can perform dom manipulation within the tranform function, but this is discouraged unless absolutely necessary. See, for example, the lazy-loaded image transform in [the NYTimesExtractor](www.nytimes.com/index.js#L25), which transforms the `src` attribute on the lazy-loaded image.
142148
143149
## How to generate a custom parser
144150
145-
Now that you know the basics of how custom extractors work, let's walk through the workflow for how to write and submit one. For our example, we're going to use [The New Yorker](http://www.newyorker.com/). (You can find the results of this tutorial [in the NewYorkerExtractor source](www.newyorker.com).)
151+
Now that you know the basics of how custom extractors work, let's walk through the workflow for how to write and submit one. For our example, we're going to create a custom parser for [The New Yorker](http://www.newyorker.com/). (You can find the results of this tutorial [in the NewYorkerExtractor source](www.newyorker.com).)
146152
147153
### Step 0: Installation
148154
@@ -162,14 +168,14 @@ If you don't have already have watchman installed, you'll also need to install t
162168
brew install watchman
163169
```
164170
165-
You should also create a new git branch for your custom extractor:
171+
Take a look at the existing custom parsers in [`src/extractors/custom`](/src/extractors/custom) for examples and to check if the site you want to write a parser for already exists.
172+
173+
If not, go ahead and create a new git branch for your custom extractor:
166174
167175
```bash
168176
git checkout -b feat-new-yorker-extractor
169177
```
170178
171-
Now that you're ready to go, take a look at the live custom parsers in [`src/extractors/custom`](/src/extractors/custom) for examples and to check if the site you want to write a parser for already exists.
172-
173179
### Step 1: Generate your custom parser
174180
175181
If we don't already have a parser for the site you want to contribute, you're ready to generate a new custom parser. To do so, run:
@@ -188,7 +194,7 @@ When the generator script completes, you'll be prompted to run:
188194
yarn watch:test www.newyorker.com
189195
```
190196
191-
This will run the tests for the parser you just generated, which should fail (which makes sense — you haven't written it yet!). Your goal now is to follow the instructions in the generated `www.newyorker.com/index.test.js` and `www.newyorker.com/index.js` files until they pass!
197+
This will run the tests for the parser you just generated, which should fail (which makes sense — you haven't written any selectors yet!). Your goal now is to follow the instructions in the generated `www.newyorker.com/index.test.js` and `www.newyorker.com/index.js` files until they pass!
192198
193199
### Step 2: Passing your first test: Title extraction
194200

0 commit comments

Comments
 (0)
Please sign in to comment.