Skip to content

Commit 4f9c9ed

Browse files
authored
Merge pull request #26 from mislav/tool-updates
Tool updates
2 parents 7c60d68 + 716c91a commit 4f9c9ed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+3355
-1123
lines changed

.gitignore

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
.mypy_cache/
22
__pycache__/
3-
extractors/go_readability/go_readability_cli
4-
extractors/go_domdistiller/go_domdistiller_cli
3+
venv

Makefile

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
.PHONY: setup
2+
setup:
3+
pip install -r requirements.txt
4+
# for news_please:
5+
python -m nltk.downloader --dir "${VIRTUAL_ENV}"/nltk_data punkt_tab
6+
cd extractors/readability_js && npm install --no-audit --no-fund
7+
make -C extractors/go_domdistiller
8+
make -C extractors/go_readability
9+
make -C extractors/go_readability_readeck
10+
make -C extractors/go_trafilatura
11+
12+
.PHONY: run-all
13+
run-all: run-go run-python
14+
python extractors/run_readability_js.py
15+
16+
.PHONY: run-go
17+
run-go: setup
18+
python extractors/go_domdistiller.py
19+
python extractors/run_go_readability_readeck.py
20+
python extractors/run_go_readability.py
21+
python extractors/run_go_trafilatura.py
22+
23+
.PHONY: run-python
24+
run-python: setup
25+
python extractors/run_beautifulsoup.py
26+
# PYTHON2 extractors/run_boilerpipe.py
27+
# ERRORED extractors/run_dragnet.py
28+
python extractors/run_goose3.py
29+
python extractors/run_html_text.py
30+
python extractors/run_html2text.py
31+
python extractors/run_inscriptis.py
32+
python extractors/run_justext.py
33+
python extractors/run_readability.py
34+
python extractors/run_trafilatura.py
35+
python extractors/run_xpath_text.py
36+
37+
.PHONY: run-slow
38+
# These libraries use machine learning inference, so they can be extremely slow
39+
run-slow: setup
40+
python extractors/run_news_please.py
41+
python extractors/run_newspaper.py

README.rst

Lines changed: 31 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@ extraction for commercial services
66
`Zyte Automatic Extraction (ours) <https://www.zyte.com/data-types/news-scraping-api/>`_,
77
`Diffbot <https://www.diffbot.com/>`_
88
and open-source libraries
9-
`newspaper3k <https://newspaper.readthedocs.io/en/latest/>`_,
9+
`newspaper4k <https://github.com/AndyTheFactory/newspaper4k>`_,
1010
`readability-lxml <https://github.com/buriy/python-readability>`_,
1111
`dragnet <https://github.com/dragnet-org/dragnet>`_,
1212
`boilerpipe <https://github.com/misja/python-boilerpipe>`_,
1313
`html-text <https://github.com/TeamHG-Memex/html-text>`_,
1414
`trafilatura <https://github.com/adbar/trafilatura>`_,
15+
`go-trafilatura <https://github.com/markusmobius/go-trafilatura>`_,
1516
`go-readability <https://github.com/go-shiori/go-readability>`_,
17+
`readeck/go-readability <https://codeberg.org/readeck/go-readability>`_,
1618
`Readability.js <https://github.com/mozilla/readability>`_,
1719
`Go-DomDistiller <https://github.com/markusmobius/go-domdistiller>`_.
1820
`news-please <https://github.com/fhamborg/news-please>`_.
@@ -47,7 +49,25 @@ Results of the initial evaluation, done in November 2019::
4749
readability-lxml 0.7.1 0.922 ± 0.014 0.913 ± 0.014 0.931 ± 0.016 0.315 ± 0.035
4850
xpath-text 4.4.2 0.394 ± 0.020 0.246 ± 0.016 0.992 ± 0.001 0.000 ± 0.000
4951

50-
Result of packages added after original evaluation::
52+
Results of the latest evaluation with open source libraries added::
53+
54+
version F1 precision recall accuracy
55+
go-trafilatura ae7ea06 0.960 ± 0.007 0.940 ± 0.009 0.980 ± 0.006 0.287 ± 0.033
56+
trafilatura 2.0.0 0.958 ± 0.006 0.938 ± 0.009 0.978 ± 0.006 0.293 ± 0.033
57+
newspaper4k 0.9.3.1 0.949 ± 0.008 0.964 ± 0.008 0.934 ± 0.011 0.326 ± 0.033
58+
news_please 1.6.16 0.948 ± 0.008 0.964 ± 0.008 0.933 ± 0.011 0.326 ± 0.034
59+
readability_js 0.6.0 0.947 ± 0.005 0.914 ± 0.008 0.982 ± 0.003 0.166 ± 0.028
60+
go_readability_fork fb0fbc5 0.947 ± 0.005 0.914 ± 0.008 0.982 ± 0.003 0.166 ± 0.028
61+
go_readability 9f5bf5c 0.934 ± 0.009 0.900 ± 0.011 0.971 ± 0.009 0.188 ± 0.029
62+
go_domdistiller 25b8d04 0.927 ± 0.007 0.901 ± 0.010 0.956 ± 0.009 0.061 ± 0.017
63+
readability-lxml 0.8.4.1 0.922 ± 0.013 0.913 ± 0.014 0.931 ± 0.015 0.315 ± 0.034
64+
goose3 3.1.20 0.896 ± 0.015 0.940 ± 0.013 0.856 ± 0.020 0.232 ± 0.031
65+
beautifulsoup 4.13.5 0.860 ± 0.016 0.850 ± 0.016 0.870 ± 0.020 0.006 ± 0.006
66+
justext 3.0.2 0.804 ± 0.018 0.858 ± 0.016 0.756 ± 0.027 0.088 ± 0.021
67+
inscriptis 2.6.0 0.679 ± 0.015 0.517 ± 0.018 0.992 ± 0.001 0.000 ± 0.000
68+
html2text 2025.4.15 0.662 ± 0.015 0.499 ± 0.017 0.983 ± 0.002 0.000 ± 0.000
69+
70+
Earlier results from April 2021::
5171

5272
version F1 precision recall accuracy
5373
trafilatura 0.5.1 0.945 ± 0.009 0.925 ± 0.011 0.966 ± 0.009 0.221 ± 0.031
@@ -108,15 +128,17 @@ In addition to benchmarking AutoExtract and Diffbot services, we also benchmark
108128
open-source libraries that work directly on HTML files without a need for rendering
109129
or external resources:
110130

111-
- newspaper3k: https://github.com/codelucas/newspaper
131+
- newspaper4k: https://github.com/AndyTheFactory/newspaper4k
112132
- readability-lxml: https://github.com/buriy/python-readability
113133
- dragnet: https://github.com/dragnet-org/dragnet
114134
- boilerpipe: https://github.com/misja/python-boilerpipe
115135
- html-text: https://github.com/TeamHG-Memex/html-text -
116136
this is a baseline which extracts the full text of HTML page
117137
- trafilatura: https://github.com/adbar/trafilatura contributed by the author
118138
at https://github.com/scrapinghub/article-extraction-benchmark/pull/4
139+
- go-trafilatura: https://github.com/markusmobius/go-trafilatura
119140
- go-readability: https://github.com/go-shiori/go-readability
141+
- readeck/go-readability: https://codeberg.org/readeck/go-readability/src/branch/main/FORK.md
120142
- Readability.js: https://github.com/mozilla/readability
121143
- Go-DomDistiller: https://github.com/markusmobius/go-domdistiller
122144
- news-please: https://github.com/fhamborg/news-please
@@ -133,21 +155,14 @@ or external resources:
133155
Output from these libraries is already present in the repo in ``output/*.json`` files.
134156
They were generated with ``extractors/run_*.py`` files.
135157

136-
All dependencies are in ``requirements.txt``.
137-
Note that dragnet may fail to install at first try, as
138-
you need to have ``numpy`` and ``Cython`` installed, and have ``libxml2`` headers
139-
(``libxml2-dev`` on Ubuntu).
158+
You can re-generate output JSON files with:
140159

141-
boilerpipe requires a custom installation: use python2, you also need Java
142-
(e.g. install ``default-jre`` in Ubuntu), install it with
143-
``pip install -e git+https://github.com/misja/python-boilerpipe.git@ab3694d7bf695b73f0684a028e70aa816d63e6cb#egg=boilerpipe``
160+
python3 -m venv ./venv
161+
source ./venv/bin/activate
162+
make run-all
144163

145-
go-readability requires a custom installation: see README in ``extractors/go_readability``.
146-
147-
Readability.js require a custom installation: install nodejs and install cli tool:
148-
``npm install -g [email protected]``
149-
150-
Go-DomDistiller requires a custom installation: see README in ``extractors/go_domdistiller``.
164+
This will install Python dependencies from ``requirements.txt`` into a
165+
`virtual environment <https://docs.python.org/3/library/venv.html>`_
151166

152167
Evaluation
153168
----------

evaluate.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,16 @@ def main():
2222
args = parser.parse_args()
2323
ground_truth = load_json(Path('ground-truth.json'))
2424
metrics_by_name = {}
25+
26+
print(f'{'':<20} {'F1':<13} {'precision':<13} {'recall':<13} {'accuracy':<13}')
2527
for path in sorted(Path('output').glob('*.json')):
2628
name = path.stem
2729
metrics = evaluate(ground_truth, load_json(path), args.n_bootstrap)
2830
print('{name:<20} '
29-
'precision={precision:.3f} ± {precision_std:.3f} '
30-
'recall={recall:.3f} ± {recall_std:.3f} '
31-
'F1={f1:.3f} ± {f1_std:.3f} '
32-
'accuracy={accuracy:.3f} ± {accuracy_std:.3f} '
31+
'{f1:.3f} ± {f1_std:.3f} '
32+
'{precision:.3f} ± {precision_std:.3f} '
33+
'{recall:.3f} ± {recall_std:.3f} '
34+
'{accuracy:.3f} ± {accuracy_std:.3f}'
3335
.format(name=name, **metrics))
3436
metrics_by_name[name] = metrics
3537

extractors/go_domdistiller.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import json
44
import os
55
import subprocess
6+
import sys
67
from pathlib import Path
78
from tempfile import mkstemp
89

@@ -11,25 +12,24 @@
1112
CLI_PATH = Path('extractors/go_domdistiller/go_domdistiller_cli')
1213

1314

15+
def normalize(s: str) -> str:
16+
# remove all U+00AD (SOFT HYPHEN)
17+
return s.replace('\u00ad', '')
18+
19+
1420
def main():
1521
output = {}
1622
for path in Path('html').glob('*.html.gz'):
1723
with gzip.open(path, 'rt', encoding='utf8') as f:
1824
html = f.read()
1925
item_id = path.stem.split('.')[0]
2026

21-
# save html to temp file
22-
temp_filepath = mkstemp()[1]
23-
with open(temp_filepath, 'wt') as fw:
24-
fw.write(html)
25-
2627
# get extracted content from go-domdistiller
27-
result = subprocess.run([CLI_PATH, temp_filepath], stdout=subprocess.PIPE)
28-
29-
# destroy temp file
30-
os.remove(temp_filepath)
28+
result = subprocess.run(CLI_PATH, input=html, text=True, stdout=subprocess.PIPE)
29+
if result.returncode != 0:
30+
print("failed: ",path,file=sys.stderr)
3131

32-
output[item_id] = {'articleBody': result.stdout.decode('utf-8')}
32+
output[item_id] = {'articleBody': normalize(result.stdout)}
3333
(Path('output') / 'go_domdistiller.json').write_text(
3434
json.dumps(output, sort_keys=True, ensure_ascii=False, indent=4),
3535
encoding='utf8')
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
vendor
2+
go_domdistiller_cli
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
go_domdistiller_cli: cli.go go.mod go.sum
2+
go build -o $@ ./cli.go

extractors/go_domdistiller/README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ To use the library I'm wrote a simple cli-module that reads the contents of the
1414
Installation
1515
------------
1616

17-
1. Install golang (I'm used version ``1.15.8``)
17+
1. Install golang 1.23+
1818
2. Go to the folder containing this file
1919
3. Build an executable file:
2020

extractors/go_domdistiller/cli.go

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,21 @@ import (
55
"os"
66

77
distiller "github.com/markusmobius/go-domdistiller"
8+
"golang.org/x/net/html"
89
)
910

1011
func main() {
11-
if len(os.Args) < 2 {
12-
panic("Input file not provided in args")
13-
}
14-
if len(os.Args) > 2 {
15-
panic("Args accept only one argument")
12+
doc, err := html.Parse(os.Stdin)
13+
if err != nil {
14+
panic(err)
1615
}
17-
input := os.Args[1]
1816

19-
opts := &distiller.Options{
20-
ExtractTextOnly: true,
17+
article, err := distiller.Apply(doc, &distiller.Options{
2118
SkipPagination: true,
22-
}
23-
24-
article, err := distiller.ApplyForFile(input, opts)
19+
})
2520
if err != nil {
2621
panic(err)
2722
}
2823

29-
fmt.Print(article.HTML)
24+
fmt.Print(article.Text)
3025
}

extractors/go_domdistiller/go.mod

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,17 @@
11
module cli
22

3-
go 1.15
3+
go 1.23
44

5-
require github.com/markusmobius/go-domdistiller v0.0.0-20201222130639-1c90a88d11c2
5+
require github.com/markusmobius/go-domdistiller v0.0.0-20240926050704-25b8d046ffb4
6+
7+
require (
8+
github.com/andybalholm/cascadia v1.3.2 // indirect
9+
github.com/go-shiori/dom v0.0.0-20230515143342-73569d674e1c // indirect
10+
github.com/gogs/chardet v0.0.0-20211120154057-b7413eaefb8f // indirect
11+
github.com/mattn/go-colorable v0.1.13 // indirect
12+
github.com/mattn/go-isatty v0.0.20 // indirect
13+
github.com/rs/zerolog v1.33.0 // indirect
14+
golang.org/x/net v0.29.0 // indirect
15+
golang.org/x/sys v0.25.0 // indirect
16+
golang.org/x/text v0.18.0 // indirect
17+
)

0 commit comments

Comments
 (0)