@@ -6,13 +6,15 @@ extraction for commercial services
66`Zyte Automatic Extraction (ours) <https://www.zyte.com/data-types/news-scraping-api/ >`_,
77`Diffbot <https://www.diffbot.com/ >`_
88and open-source libraries
9- `newspaper3k <https://newspaper.readthedocs.io/en/latest/ >`_,
9+ `newspaper4k <https://github.com/AndyTheFactory/newspaper4k >`_,
1010`readability-lxml <https://github.com/buriy/python-readability >`_,
1111`dragnet <https://github.com/dragnet-org/dragnet >`_,
1212`boilerpipe <https://github.com/misja/python-boilerpipe >`_,
1313`html-text <https://github.com/TeamHG-Memex/html-text >`_,
1414`trafilatura <https://github.com/adbar/trafilatura >`_,
15+ `go-trafilatura <https://github.com/markusmobius/go-trafilatura >`_,
1516`go-readability <https://github.com/go-shiori/go-readability >`_,
17+ `readeck/go-readability <https://codeberg.org/readeck/go-readability >`_,
1618`Readability.js <https://github.com/mozilla/readability >`_,
1719`Go-DomDistiller <https://github.com/markusmobius/go-domdistiller >`_.
1820`news-please <https://github.com/fhamborg/news-please >`_.
@@ -47,7 +49,25 @@ Results of the initial evaluation, done in November 2019::
4749 readability-lxml 0.7.1 0.922 ± 0.014 0.913 ± 0.014 0.931 ± 0.016 0.315 ± 0.035
4850 xpath-text 4.4.2 0.394 ± 0.020 0.246 ± 0.016 0.992 ± 0.001 0.000 ± 0.000
4951
50- Result of packages added after original evaluation::
52+ Results of the latest evaluation with open source libraries added::
53+
54+ version F1 precision recall accuracy
55+ go-trafilatura ae7ea06 0.960 ± 0.007 0.940 ± 0.009 0.980 ± 0.006 0.287 ± 0.033
56+ trafilatura 2.0.0 0.958 ± 0.006 0.938 ± 0.009 0.978 ± 0.006 0.293 ± 0.033
57+ newspaper4k 0.9.3.1 0.949 ± 0.008 0.964 ± 0.008 0.934 ± 0.011 0.326 ± 0.033
58+ news_please 1.6.16 0.948 ± 0.008 0.964 ± 0.008 0.933 ± 0.011 0.326 ± 0.034
59+ readability_js 0.6.0 0.947 ± 0.005 0.914 ± 0.008 0.982 ± 0.003 0.166 ± 0.028
60+ go_readability_fork fb0fbc5 0.947 ± 0.005 0.914 ± 0.008 0.982 ± 0.003 0.166 ± 0.028
61+ go_readability 9f5bf5c 0.934 ± 0.009 0.900 ± 0.011 0.971 ± 0.009 0.188 ± 0.029
62+ go_domdistiller 25b8d04 0.927 ± 0.007 0.901 ± 0.010 0.956 ± 0.009 0.061 ± 0.017
63+ readability-lxml 0.8.4.1 0.922 ± 0.013 0.913 ± 0.014 0.931 ± 0.015 0.315 ± 0.034
64+ goose3 3.1.20 0.896 ± 0.015 0.940 ± 0.013 0.856 ± 0.020 0.232 ± 0.031
65+ beautifulsoup 4.13.5 0.860 ± 0.016 0.850 ± 0.016 0.870 ± 0.020 0.006 ± 0.006
66+ justext 3.0.2 0.804 ± 0.018 0.858 ± 0.016 0.756 ± 0.027 0.088 ± 0.021
67+ inscriptis 2.6.0 0.679 ± 0.015 0.517 ± 0.018 0.992 ± 0.001 0.000 ± 0.000
68+ html2text 2025.4.15 0.662 ± 0.015 0.499 ± 0.017 0.983 ± 0.002 0.000 ± 0.000
69+
70+ Earlier results from April 2021::
5171
5272 version F1 precision recall accuracy
5373 trafilatura 0.5.1 0.945 ± 0.009 0.925 ± 0.011 0.966 ± 0.009 0.221 ± 0.031
@@ -108,15 +128,17 @@ In addition to benchmarking AutoExtract and Diffbot services, we also benchmark
108128open-source libraries that work directly on HTML files without a need for rendering
109129or external resources:
110130
111- - newspaper3k : https://github.com/codelucas/newspaper
131+ - newspaper4k : https://github.com/AndyTheFactory/newspaper4k
112132- readability-lxml: https://github.com/buriy/python-readability
113133- dragnet: https://github.com/dragnet-org/dragnet
114134- boilerpipe: https://github.com/misja/python-boilerpipe
115135- html-text: https://github.com/TeamHG-Memex/html-text -
116136 this is a baseline which extracts the full text of HTML page
117137- trafilatura: https://github.com/adbar/trafilatura contributed by the author
118138 at https://github.com/scrapinghub/article-extraction-benchmark/pull/4
139+ - go-trafilatura: https://github.com/markusmobius/go-trafilatura
119140- go-readability: https://github.com/go-shiori/go-readability
141+ - readeck/go-readability: https://codeberg.org/readeck/go-readability/src/branch/main/FORK.md
120142- Readability.js: https://github.com/mozilla/readability
121143- Go-DomDistiller: https://github.com/markusmobius/go-domdistiller
122144- news-please: https://github.com/fhamborg/news-please
@@ -133,21 +155,14 @@ or external resources:
133155Output from these libraries is already present in the repo in ``output/*.json `` files.
134156They were generated with ``extractors/run_*.py `` files.
135157
136- All dependencies are in ``requirements.txt ``.
137- Note that dragnet may fail to install at first try, as
138- you need to have ``numpy `` and ``Cython `` installed, and have ``libxml2 `` headers
139- (``libxml2-dev `` on Ubuntu).
158+ You can re-generate output JSON files with:
140159
141- boilerpipe requires a custom installation: use python2, you also need Java
142- (e.g. install `` default-jre `` in Ubuntu), install it with
143- `` pip install -e git+https://github.com/misja/python-boilerpipe.git@ab3694d7bf695b73f0684a028e70aa816d63e6cb#egg=boilerpipe ``
160+ python3 -m venv ./venv
161+ source ./venv/bin/activate
162+ make run-all
144163
145- go-readability requires a custom installation: see README in ``extractors/go_readability ``.
146-
147- Readability.js require a custom installation: install nodejs and install cli tool:
148- ``
npm install -g [email protected] ``
149-
150- Go-DomDistiller requires a custom installation: see README in ``extractors/go_domdistiller ``.
164+ This will install Python dependencies from ``requirements.txt `` into a
165+ `virtual environment <https://docs.python.org/3/library/venv.html >`_
151166
152167Evaluation
153168----------
0 commit comments