Skip to content
This repository has been archived by the owner on Jan 4, 2021. It is now read-only.

Use the hashed contents of a file for media hashes #584

Open
hdodov opened this issue Jul 21, 2020 · 3 comments
Open

Use the hashed contents of a file for media hashes #584

hdodov opened this issue Jul 21, 2020 · 3 comments

Comments

@hdodov
Copy link

hdodov commented Jul 21, 2020

Problem

I guess Kirby uses the hashed file ID with file root as salt and the file modified date for the media hash due to performance reasons. Sure, the modified date and the file ID are great indicators for whether the file has been modified, but a that's not always the case, as I've described in #583. A guarantee for that will be the actual file content.

Solution

On average, hashing the contents of a 10MB+ file takes about 0.03 seconds, according to my tests. While that's way longer than hashing its file ID, it's still not that much if done only once. Perhaps there's a smart way to hash the file contents and "cache" that hash for future usage?

The first thing that comes to mind are the meta TXT files that Kirby creates in the content folder for each file. The hash can be stored there and used by the media component. Since the meta file and the actual file go hand in hand, that shouldn't be so crazy? I mean - you replace an image in the panel, Kirby calculates its hash, and stores it in its meta file. Later, when you request that image, Kirby uses that hash for its media logic.

The obvious downside is that if you're not using the panel, you need to fill in those hashes manually in the content files or just add unique random strings. For this use case, the current implementation is great. This is why Kirby should have this behavior as default and only use the hash if it's present in the file's TXT.

So there could be two options:

1. Generate file hashes

An option that tells Kirby to generate the hash of a file each time it's updated and store it in that file's meta file. Aside from media usage, this could be useful for other things, I guess. You could have plugins that utilize this feature.

2. Use hashes for media

By default, Kirby makes up the media hash as it does now. But if you specify this new option, it looks for a hash field in the TXT meta file and if it finds one, it uses that. I guess it would be easy to implement - in the File constructor, you just check if there's a hash field and store it as a property. Then, the mediaHash() method uses that property.


Having the opportunity to somehow use the actual hashed file contents of the file would be ideal for cases like #583. Maybe it's an overkill for basic sites, but if you have a more advanced setup - that would be a lifesaver.

@hdodov
Copy link
Author

hdodov commented Jul 22, 2020

I tested this:

I guess it would be easy to implement - in the File constructor, you just check if there's a hash field and store it as a property. Then, the mediaHash() method uses that property.

...and it took about 3 lines of code. It worked.


Another solution that won't use meta files and will work without the panel is to use the file's modification date only to tell when to calculate the hash. For example, you could have media/pages/foo/bar.png.hash that simply contains the hash of bar.png. When you create media images, you compare the modification date of the .hash file to the modification date of the original file in the content folder. If the original file was modified after the hash file - Kirby has to recalculate the hash and update the hash file. Later, Kirby uses the calculated hash in the URLs:

https://example.com/media/pages/foo/2afd4a36c6/bar.png

This solution wouldn't have the problems of #583. Sure, it still relies on modification dates, but they are not part of the media hash. They are merely used to determine when to calculate the hash. The hash itself will always be the same.

@hdodov
Copy link
Author

hdodov commented Jul 22, 2020

A more optimized solution based on the one above is to have a flat structure of all files in the entire site, since they all have hashes. For example, you still have the site and pages folders to store the .hash files:

media/pages/foo/bar.png.hash

But the file itself is stored like this:

media/files/2afd4a36c69a2e3f5852f6d7d4078609/file.png
media/files/2afd4a36c69a2e3f5852f6d7d4078609/800x.png // variant
media/files/2afd4a36c69a2e3f5852f6d7d4078609/1600x.png // variant

The media URL is like this:

https://example.com/media/2afd4a36c69a2e3f5852f6d7d4078609/pages/foo/bar.png

The file ID pages/foo/bar.png is used just for SEO. What's needed is just the hash and the filename, from which the variant can be deduced. The main benefit of this is that if you have the same file in multiple pages or you rename it in the same page - there won't be any new variants generated. You will always have one set of variants for a given file regardless of the pages in which it exists, what name it has, and when it was last modified. Even if you delete a file and upload it again later - it would still have its variants because the hash will be the same.

@hdodov
Copy link
Author

hdodov commented Aug 6, 2020

Yet another benefit of this approach is when you have a static site.

I have a site that I deploy to Netlify and it gets built on each push to the master branch of my repo. Then, Netlify builds the assets and generates all pages plus media files. The problem is - the date modified of each file in the content folder changes every time, therefore making media hashes different. This invalidates CDN caches for no reason. If media hashes are made up of the file contents, caches will be invalidated only when the actual content has changed.

Even further, if the media folder contains the hash of each file (in a .hash file), then those hashes could be hashed yet again to form a "summary hash" that tells whether there were changes in the media files. If the summary hash is the same between two deploys, that means all .hash files have not changed and therefore all media files have not changed. This could be pretty useful. For instance, you could have some sort of caching strategy for the media folder similar to how you can speed up deploys in GitHub Actions by caching dependencies.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant