-
-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Fetch metadata as GoogleBot #4621
Comments
This sounds like a very specific use case which would be rather complicated to implement. Maybe best to do it via an extension. |
The actual change is just a single line of code in my fork. To make it configurable is the part that it makes it difficult to implement. So maybe its better to just have a fixed list in the code with domains that dont work without A good side-effect could be that when people find domains that dont work with Lemmy, they are forced to do a pull-request to extend the global list, instead of just adding them to their local list. This way other instances will benefit from it too. |
We could just add an optional |
@dessalines As described earlier, that won't work. Some domains need So a single user-agent for metadata fetching will not work. Thats why we need a list somewhere, and wether that one is hard-coded or configurable is not really important to me. |
In that case you could add a config to struct DomainAndUserAgent {
domain: Url,
user_agent: String,
};
struct MetadataFetcherUserAgent {
domain_and_user_agents: Vec<DomainAndUserAgent>,
}; |
If requests are attempted with both user agents, would it be possible to automatically determine which response to use? |
@dullbananas Yes.. By checking if the response contains OpenGraph tags or not. |
FWIW, a website that I maintain blocks fake user agents, e.g. things that claim to be Googlebot when they are not coming from Google's networks. (The site shows OpenGraph data to all user agents, though.) |
I just realized the solution hinted to by @dullbananas would be so much easier. Instead of keeping a list of which domains need That way there is no need to keep any lists. I modified the feature request accordingly now. |
Requirements
Is your proposal related to a problem?
A lot of newspaper websites show a cookie-wall, which prevents the OpenGraph metadata from being fetched in the link previews.
After contacting some of them, they told me to just use the
GoogleBot
user-agent for fetching metadata in forum software. (A bit strange advice for a large company, but this was their official solution).So after modifying the Lemmy source to identify as Google for scraping metadata, things worked fine for those newspapers.
But I noticed for a couple of other websites, it stopped working. For example links to
lemmy.world
show a Cloudflare error when usingGoogleBot
, probably because of some bot protection mechanism. And other websites stopped working because they present a different page to Google without any metadata.So unfortunelately switching to
GoogleBot
only fixes the issue on some domains, and creates an issue on others.Describe the solution you'd like.
It would be really nice that when fetching the metadata using the
Lemmy
useragent fails, it will retry it one more time usingGoogleBot
.Describe alternatives you've considered.
There is no alternative
Additional context
No response
The text was updated successfully, but these errors were encountered: