Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store blobs #13

Open
snarfed opened this issue Oct 5, 2023 · 3 comments
Open

Store blobs #13

snarfed opened this issue Oct 5, 2023 · 3 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Oct 5, 2023

Not a priority for Bridgy Fed or me otherwise personally, but we should probably implement blob storage, uploadBlob/getBlob, etc.

snarfed added a commit that referenced this issue Oct 5, 2023
get_or_create fetches a URL, calculates its CID, and stores it

for #13
@snarfed
Copy link
Owner Author

snarfed commented Nov 8, 2023

Deprioritizing. Shipped remote blobs w/datastore_storage.AtpRemoteBlob for generating blobs for externally hosted files, which is working well enough for my needs.

@snarfed snarfed changed the title Blobs Store blobs Aug 28, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 28, 2024

This came up again recently: Bridgy Fed hit a case where (our best guess is) an image URL originally served one format, image/webp, and then later switched to serving an image/webp. We fetched the first image, saw image/webp, stored that and the URL and image CID in an AtpRemoteBlob, and populated that CID and mime type into a blob in a record. Then, the URL switched to serving a image/jpeg, Bluesky team's blob scanning fetched it, saw the type mismatch, and complained.

Not storing/hosting media has been convenient for us, for Reasons etc, but it's technically not ATProto compliant, since we can't guarantee that blobs are immutable, ie the URL we redirect getBlob requests to could serve a different image that doesn't match the CID and type we originally created the blob with.

cc @ericvolp12

@snarfed
Copy link
Owner Author

snarfed commented Aug 28, 2024

Specifically, the post that triggered this was:

Maybe the original image here was WEBP, and the JPEG is a downstream transcoding, and the CMS does that transcoding in the background, after the article is published, and serves the original image until the JPEG is ready? Maybe a bit of a stretch, but not too much? I dunno.

Here's our code for this:

def get_or_create(cls, *, url=None, get_fn=requests.get):
"""Returns a new or existing :class:`AtpRemoteBlob` for a given URL.
If there isn't an existing :class:`AtpRemoteBlob`, fetches the URL over
the network and creates a new one for it.
Args:
url (str)
get_fn (callable): for making HTTP GET requests
Returns:
AtpRemoteBlob: existing or newly created :class:`AtpRemoteBlob`
Raises:
requests.RequestException: if the HTTP request to fetch the blob failed
"""
assert url
existing = cls.get_by_id(url)
if existing:
return existing
resp = get_fn(url)
resp.raise_for_status()
mime_type = resp.headers.get('Content-Type')
if not mime_type:
mime_type, _ = mimetypes.guess_type(url)
digest = multihash.digest(resp.content, 'sha2-256')
cid = CID('base58btc', 1, 'raw', digest).encode('base32')
logger.info(f'Creating new AtpRemoteBlob for {url} CID {cid}')
mime_type_prop = {'mime_type': mime_type} if mime_type else {}
blob = cls(id=url, cid=cid, size=len(resp.content), **mime_type_prop)
blob.put()
return blob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant