Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Srcset images not being archived #243

Open
JubilantJerry opened this issue Dec 11, 2024 · 0 comments
Open

Srcset images not being archived #243

JubilantJerry opened this issue Dec 11, 2024 · 0 comments

Comments

@JubilantJerry
Copy link

grab-site seems to not archive the <picture><srcset> URL in a Substack blog that I tried the tool on. I believe this may be an issue in wpull.

image

image

Step-by-step reproduction instructions

First I run:

grab-site --level=2 --concurrency=20 --page-requisites-level=2 --import-ignores=$(pwd)/ignores 'https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting' 'https://substackcdn.com/bundle/assets/store.modern-3dec36e9.js' 'https://substack-post-media.s3.amazonaws.com/public/images/4206cf36-9fcc-4b06-95e1-d751f9f4c3b7_388x388.jpeg'

I include these other two URLs so that their domain names shouldn't be considered "offsite".

The contents of the ignores file is:

platform.openai.com
reddit.com
discord.com
discordapp.com
^https?://[^p][^.]+.substack.com
shopify.com
^https://static.airtable.com/esbuild/by_sha
https://promptingweekly.substack.com/account\?utm_medium=web&utm_source=subscribe-widget
https://promptingweekly.substack.com/p/[^?/]+\?utm_source=substack&utm_medium=email&utm_content=share&action=share&token=

Then I open the archive using ReplayWeb.page-2.2.4.AppImage, and navigate to the page: https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting

You can download the WARC here: https://drive.google.com/file/d/1fJuWwgSTVfh9IdD47RC2lw67tWSryG4S/view?usp=sharing

Appearance of replayed page

There are several images on the page that directly get displayed when opening the live site. However, archiving the page with grab-site and replaying with ReplayWeb.page, the images do not load directly, appearing as broken images or blank spaces.

Archived:
image

Live site:
image

Archived:
image

Live site:
image

The same issues are observed with pywb

In addition, some scripts don't work properly. When navigating to the previous or next blog page, ReplayWeb.page will first display a page saying "Post not found". Refreshing the page will make it load properly (but still with the missing images).

image

My belief is that both the missing images and the script errors are caused by missing files in the crawl.

Additional details

I run Ubuntu 20.04 LTS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant