Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved scraping: Automatic full article text retrieval #20

Open
digitaldutch opened this issue Sep 30, 2024 · 0 comments
Open

Improved scraping: Automatic full article text retrieval #20

digitaldutch opened this issue Sep 30, 2024 · 0 comments

Comments

@digitaldutch
Copy link
Owner

digitaldutch commented Sep 30, 2024

Some media websites provide the full article text using the json-ld articleBody tag. Most do not.

Current situation
Web scraping is now performed using headless Chromium to download the DOM. That DOM is parsed in PHP for the relevant tags.

What we want
Currently the user has to copy and paste that manually. Automation of this task is desired. This could be done using Playwright, Puppeteer, Selenium or another web scraping tool. After a quick scan Playwright looks the easiest to use and install and the most modern and easiest API. Every website has to be manually tweaked by studying the DOM.

Using the current headless Chromium to download the page and DOMDocument in PHP or Beautifulsoup in Python to find the text from the HTML might work too. This has the advantage that no extra tools need to be installed on the server.

Additional task
Sometimes logging in is required. For sites where we have a subscription, automatic logging in using one of the tools above is needed.

@digitaldutch digitaldutch changed the title Full article text retrieval Automatic full article text retrieval Sep 30, 2024
@digitaldutch digitaldutch changed the title Automatic full article text retrieval Improved scraping: Automatic full article text retrieval Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant