[Feature] Fetch metadata as GoogleBot #4621

kroese · 2024-04-13T09:26:47Z

Requirements

Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
Did you check to see if this issue already exists?
Is this only a feature request? Do not put multiple feature requests in one issue.
Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
Do you agree to follow the rules in our Code of Conduct?

Is your proposal related to a problem?

A lot of newspaper websites show a cookie-wall, which prevents the OpenGraph metadata from being fetched in the link previews.

After contacting some of them, they told me to just use the GoogleBot user-agent for fetching metadata in forum software. (A bit strange advice for a large company, but this was their official solution).

So after modifying the Lemmy source to identify as Google for scraping metadata, things worked fine for those newspapers.

But I noticed for a couple of other websites, it stopped working. For example links to lemmy.world show a Cloudflare error when using GoogleBot, probably because of some bot protection mechanism. And other websites stopped working because they present a different page to Google without any metadata.

So unfortunelately switching to GoogleBot only fixes the issue on some domains, and creates an issue on others.

Describe the solution you'd like.

It would be really nice that when fetching the metadata using the Lemmy useragent fails, it will retry it one more time using GoogleBot.

Describe alternatives you've considered.

There is no alternative

Additional context

No response

The text was updated successfully, but these errors were encountered:

Nutomic · 2024-04-15T10:34:37Z

This sounds like a very specific use case which would be rather complicated to implement. Maybe best to do it via an extension.

kroese · 2024-04-16T12:02:49Z

The actual change is just a single line of code in my fork. To make it configurable is the part that it makes it difficult to implement.

So maybe its better to just have a fixed list in the code with domains that dont work without GoogleBot, as that would be much simpler.

A good side-effect could be that when people find domains that dont work with Lemmy, they are forced to do a pull-request to extend the global list, instead of just adding them to their local list. This way other instances will benefit from it too.

dessalines · 2024-04-17T14:33:48Z

We could just add an optional custom_metadata_fetcher_user_agent to the config hjson. We could go as complicated as per domain, but I doubt that's necessary, as long as we limit it to metadata fetching only.

kroese · 2024-04-17T14:58:44Z

@dessalines As described earlier, that won't work. Some domains need GoogleBot, otherwise you are redirected to their cookie-wall, and other domains refuse requests from GoogleBot (like the lemmy.world Cloudflare protection who denies the request for example).

So a single user-agent for metadata fetching will not work. Thats why we need a list somewhere, and wether that one is hard-coded or configurable is not really important to me.

dessalines · 2024-04-17T15:16:58Z

In that case you could add a config to crates/utils/src/settings/structs.rs that looks something like:

struct DomainAndUserAgent {
  domain: Url,
  user_agent: String,
};

struct MetadataFetcherUserAgent {
  domain_and_user_agents: Vec<DomainAndUserAgent>,
};

dullbananas · 2024-04-20T02:48:57Z

If requests are attempted with both user agents, would it be possible to automatically determine which response to use?

kroese · 2024-04-20T07:36:21Z

@dullbananas Yes.. By checking if the response contains OpenGraph tags or not.

robrwo · 2024-09-16T09:14:02Z

FWIW, a website that I maintain blocks fake user agents, e.g. things that claim to be Googlebot when they are not coming from Google's networks. (The site shows OpenGraph data to all user agents, though.)

kroese · 2024-09-23T16:26:46Z

I just realized the solution hinted to by @dullbananas would be so much easier.

Instead of keeping a list of which domains need GoogleBot, to just automaticly try GoogleBot for every domain that fails to return metadata using the Lemmy useragent.

That way there is no need to keep any lists. I modified the feature request accordingly now.

kroese added the enhancement New feature or request label Apr 13, 2024

kroese changed the title ~~[Feature] Override user-agent per domain~~ [Feature] Fetch metadata as GoogleBot Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Fetch metadata as GoogleBot #4621

[Feature] Fetch metadata as GoogleBot #4621

kroese commented Apr 13, 2024 •

edited

Loading

Nutomic commented Apr 15, 2024

kroese commented Apr 16, 2024

dessalines commented Apr 17, 2024

kroese commented Apr 17, 2024

dessalines commented Apr 17, 2024

dullbananas commented Apr 20, 2024

kroese commented Apr 20, 2024

robrwo commented Sep 16, 2024

kroese commented Sep 23, 2024 •

edited

Loading

[Feature] Fetch metadata as GoogleBot #4621

[Feature] Fetch metadata as GoogleBot #4621

Comments

kroese commented Apr 13, 2024 • edited Loading

Requirements

Is your proposal related to a problem?

Describe the solution you'd like.

Describe alternatives you've considered.

Additional context

Nutomic commented Apr 15, 2024

kroese commented Apr 16, 2024

dessalines commented Apr 17, 2024

kroese commented Apr 17, 2024

dessalines commented Apr 17, 2024

dullbananas commented Apr 20, 2024

kroese commented Apr 20, 2024

robrwo commented Sep 16, 2024

kroese commented Sep 23, 2024 • edited Loading

kroese commented Apr 13, 2024 •

edited

Loading

kroese commented Sep 23, 2024 •

edited

Loading