You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some media websites provide the full article text using the json-ld articleBody tag. Most do not.
Current situation
Web scraping is now performed using headless Chromium to download the DOM. That DOM is parsed in PHP for the relevant tags.
What we want
Currently the user has to copy and paste that manually. Automation of this task is desired. This could be done using Playwright, Puppeteer, Selenium or another web scraping tool. After a quick scan Playwright looks the easiest to use and install and the most modern and easiest API. Every website has to be manually tweaked by studying the DOM.
Using the current headless Chromium to download the page and DOMDocument in PHP or Beautifulsoup in Python to find the text from the HTML might work too. This has the advantage that no extra tools need to be installed on the server.
Additional task
Sometimes logging in is required. For sites where we have a subscription, automatic logging in using one of the tools above is needed.
The text was updated successfully, but these errors were encountered:
digitaldutch
changed the title
Full article text retrieval
Automatic full article text retrieval
Sep 30, 2024
digitaldutch
changed the title
Automatic full article text retrieval
Improved scraping: Automatic full article text retrieval
Sep 30, 2024
Some media websites provide the full article text using the json-ld articleBody tag. Most do not.
Current situation
Web scraping is now performed using headless Chromium to download the DOM. That DOM is parsed in PHP for the relevant tags.
What we want
Currently the user has to copy and paste that manually. Automation of this task is desired. This could be done using Playwright, Puppeteer, Selenium or another web scraping tool. After a quick scan Playwright looks the easiest to use and install and the most modern and easiest API. Every website has to be manually tweaked by studying the DOM.
Using the current headless Chromium to download the page and DOMDocument in PHP or Beautifulsoup in Python to find the text from the HTML might work too. This has the advantage that no extra tools need to be installed on the server.
Additional task
Sometimes logging in is required. For sites where we have a subscription, automatic logging in using one of the tools above is needed.
The text was updated successfully, but these errors were encountered: