Global deduplication for specific URLs #443

JustAnotherArchivist · 2020-05-20T02:36:04Z

While global deduplication for everything in ArchiveBot is not feasible, we should consider adding something for certain URLs that waste a lot of disk space, probably shouldn't be ignored entirely, but are regrabbed needlessly and repeatedly. Two examples come to mind:

CBC radio recordings/podcasts: ignores ^https?://mp3\.cbc\.ca/ and ^https?://podcast-a\.akamaihd\.net/mp3/ (pending further investigation whether the latter also has non-CBC content)
Fast Company videos: ignore ^https?://content\.jwplatform\.com/videos/

Currently, these ignores are typically manually added when someone sees it. I know we've grabbed some of those URLs thousands of times, but others were never covered before. Because the contents on these hosts don't change with time, ignoring them if they've ever been grabbed before by some AB job should be fine. However, job starting URLs should not be checked against the dedupe list so that they can be saved again if needed – specifically, this means that URL table entries with level = 0 would always be retrieved.

An implementation would probably keep the dedupe DB and the list of URL patterns to be checked against it on the control node. The latter is pushed to the pipelines (and updated if it changes), then the pipeline queries the DB on encountering a matching URL. TBD is whether the pipeline should be able to directly add entries to the DB or whether they should come from the CDXs in the AB collection. The latter is more trustworthy (and also covers the unfortunate case when archives are lost between retrieval and IA upload) but adds a delay which can still lead to repeated retrieval. Alternatively, pipelines could add a temporary entry which gets dropped after a few days if it isn't confirmed by the CDXs.

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist · 2020-06-06T18:47:35Z

New York Times videos: ^https?://vp\.nyt\.com/ and ^https?://video1\.nytimes\.com/
Videos hosted on JW Player: ^https?://cdn\.jwplayer\.com/videos/ (Note, these are different from the content.jwplatform.com one above. cdn.jwplayer.com serves various things by customers, content.jwplatform.com seems to only have FastCo videos.)

An alternative solution would be to dedupe based on data type or size, but that would require a new download every time and might slow down some crawls massively. If we go down this road, we should write revisit records for those; wpull already has support for that, it would just have to be activated and the remote calls implemented through a custom URLTable.

JustAnotherArchivist · 2020-11-07T02:21:57Z

Washington Post videos: ^https?://d21rhj7n383afu\.cloudfront\.net/washpost-production/ and ^https?://videos\.posttv\.com/washpost-production/
USA Today: ^https?://videos\.usatoday\.net/Brightcove2/
AnyClip: ^https?://cdn([1-9]|1\d|20)\.anyclip\.com/.*\.mp4$
NYT, rarer than the above: ^https?://int\.nyt\.com/data/videotape/finished/.*\.mp4$ (e.g. job 2s9go7u96uf6eyhtlpemdpouu)
Wall Street Journal: ^https?://m\.wsj\.net/video/.*\.mp4$
ESPN: ^https?://media\.video-cdn\.espn\.com/.*\.mp4$ and ^https?://media\.video-origin\.espn\.com/.*\.mp4$
IGN: ^https?://assets\d+\.ign\.com/videos/zencoder/.*\.mp4$ and ^https?://s3\.amazonaws\.com/o\.videoarchive\.ign\.com/.*\.mp4$
MLB: ^https?://(cuts\.diamond|mlb-cuts-diamond)\.mlb\.com/FORGE/.*\.mp4$

JustAnotherArchivist · 2020-11-12T21:58:50Z

Alternative for a proper global dedupe (which would likely require changes in wpull because the URLTable methods aren't async): special ignores that send the URL to a logger. Then we regularly dedupe what the logger receives and run those URLs separately in !ao < jobs, e.g. weekly or monthly (automated).

This igset is only intended as a temporary workaround until ArchiveTeam#443 is implemented properly. Does not include the JW Player customer videos as those are not as frequent as the FastCo ones.

JustAnotherArchivist added enhancement backend pipeline labels May 20, 2020

JustAnotherArchivist mentioned this issue Nov 14, 2020

Add (temporary) badvideos igset #479

Merged

This was referenced Jan 24, 2021

Add IGN and MLB to badvideos #494

Merged

Add AnyClip to badvideos #480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global deduplication for specific URLs #443

Global deduplication for specific URLs #443

JustAnotherArchivist commented May 20, 2020

JustAnotherArchivist commented Jun 6, 2020 •

edited

Loading

JustAnotherArchivist commented Nov 7, 2020 •

edited

Loading

JustAnotherArchivist commented Nov 12, 2020

Global deduplication for specific URLs #443

Global deduplication for specific URLs #443

Comments

JustAnotherArchivist commented May 20, 2020

JustAnotherArchivist commented Jun 6, 2020 • edited Loading

JustAnotherArchivist commented Nov 7, 2020 • edited Loading

JustAnotherArchivist commented Nov 12, 2020

JustAnotherArchivist commented Jun 6, 2020 •

edited

Loading

JustAnotherArchivist commented Nov 7, 2020 •

edited

Loading