Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit all fields with varchar of 255 characters to 190 unicode characters. #2151

Closed
wants to merge 3 commits into from

Conversation

Daniel-KM
Copy link
Contributor

With utf8-mb4, it is not possible to index many fields that are 255 characters length, since the index is limited to 767 bits, so 191 characters.

Initially, i had a big performance issue on media-type: i have some bases with more than 500000 files, that are archives scanned page by page, and a simple select distinct media_type from media take more than 5 seconds, but that is instant with an index. So the first commit fixes this point.

The next one fixes other fields with the same limit, but i didn't add indexes for now (but i think i need some of them, like ingester/renderer, extension, etc., and type and lang in the table value should be 190 characters too to be indexed.

The last commit does the same for some labels and names, but they are less important since the biggest table are media and value.

@zerocrates
Copy link
Member

I'm slightly of two minds on this one; obviously we've done it elsewhere and for most/all of this any reasonable values won't be expected to come near the limits.

On the other hand, innodb_large_prefix is on by default in 5.7 and not even a setting at all in 8.0+, and similar in MariaDB. Both minimum supported versions also use the "Barracuda" format by default. So we could pretty reliably rely on support for 3072-byte prefixes (768 utf8mb4 characters) rather than the old 767 (191 characters), and in a certain sense we'd be catering to a pretty much obsolete restriction.

Still, it may be a good idea simply because we don't really need the extra length and doing it avoids problems with corner cases like databases with odd settings or with tables/tablespaces with old file and/or row formats.

@Daniel-KM
Copy link
Contributor Author

Yes, in our cases, it was an old mariadb database that was upgraded (10.0.38) to a recent one (10.6.16), but the indexes were still 787 bytes, so it seems there was an upgrade issue or a setting somewhere that wasn't updated or the indexes in tables were not informed about this new limit. All settings were default ones.

@Daniel-KM
Copy link
Contributor Author

So the issue may occur on old omeka s installations.

@zerocrates
Copy link
Member

Thanks, that's useful information.

@zerocrates
Copy link
Member

The one specifically for media type since it goes with an added index I don't have any problem with, (that is, the first commit here). The others that are shrunken on a more general or "just in case" basis, I'm less sure on.

For some of those like module versions or job status, I might prefer only (or additionally) setting the collation to latin1 or ascii so it's single-byte and just restricting what the possible values are.... but I'd have to think about whether it's worth it to bother with that. In some ways cutting down the length as you're doing is less risky for some/many/all of these. Where we aren't actually indexing and the value is user-provided like the many label columns here, I'd probably just as soon avoid doing anything unless/until necessary.

@zerocrates
Copy link
Member

I've cherry-picked the media type index change. I think we'll hold off on changing the others prospectively unless/until we have a specific need for them.

@zerocrates zerocrates closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants