Improved scraping: Automatic full article text retrieval #20

digitaldutch · 2024-09-30T15:59:07Z

Some media websites provide the full article text using the json-ld articleBody tag. Most do not.

Current situation
Web scraping is now performed using headless Chromium to download the DOM. That DOM is parsed in PHP for the relevant tags.

What we want
Currently the user has to copy and paste that manually. Automation of this task is desired. This could be done using Playwright, Puppeteer, Selenium or another web scraping tool. After a quick scan Playwright looks the easiest to use and install and the most modern and easiest API. Every website has to be manually tweaked by studying the DOM.

Using the current headless Chromium to download the page and DOMDocument in PHP or Beautifulsoup in Python to find the text from the HTML might work too. This has the advantage that no extra tools need to be installed on the server.

Additional task
Sometimes logging in is required. For sites where we have a subscription, automatic logging in using one of the tools above is needed.

digitaldutch changed the title ~~Full article text retrieval~~ Automatic full article text retrieval Sep 30, 2024

digitaldutch changed the title ~~Automatic full article text retrieval~~ Improved scraping: Automatic full article text retrieval Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved scraping: Automatic full article text retrieval #20

Improved scraping: Automatic full article text retrieval #20

digitaldutch commented Sep 30, 2024 •

edited

Loading

Improved scraping: Automatic full article text retrieval #20

Improved scraping: Automatic full article text retrieval #20

Comments

digitaldutch commented Sep 30, 2024 • edited Loading

digitaldutch commented Sep 30, 2024 •

edited

Loading