Limit all fields with varchar of 255 characters to 190 unicode characters. #2151

Daniel-KM · 2024-02-19T10:25:12Z

With utf8-mb4, it is not possible to index many fields that are 255 characters length, since the index is limited to 767 bits, so 191 characters.

Initially, i had a big performance issue on media-type: i have some bases with more than 500000 files, that are archives scanned page by page, and a simple select distinct media_type from media take more than 5 seconds, but that is instant with an index. So the first commit fixes this point.

The next one fixes other fields with the same limit, but i didn't add indexes for now (but i think i need some of them, like ingester/renderer, extension, etc., and type and lang in the table value should be 190 characters too to be indexed.

The last commit does the same for some labels and names, but they are less important since the biggest table are media and value.

…dexation.

zerocrates · 2024-02-22T02:48:45Z

I'm slightly of two minds on this one; obviously we've done it elsewhere and for most/all of this any reasonable values won't be expected to come near the limits.

On the other hand, innodb_large_prefix is on by default in 5.7 and not even a setting at all in 8.0+, and similar in MariaDB. Both minimum supported versions also use the "Barracuda" format by default. So we could pretty reliably rely on support for 3072-byte prefixes (768 utf8mb4 characters) rather than the old 767 (191 characters), and in a certain sense we'd be catering to a pretty much obsolete restriction.

Still, it may be a good idea simply because we don't really need the extra length and doing it avoids problems with corner cases like databases with odd settings or with tables/tablespaces with old file and/or row formats.

Daniel-KM · 2024-02-22T12:39:25Z

Yes, in our cases, it was an old mariadb database that was upgraded (10.0.38) to a recent one (10.6.16), but the indexes were still 787 bytes, so it seems there was an upgrade issue or a setting somewhere that wasn't updated or the indexes in tables were not informed about this new limit. All settings were default ones.

Daniel-KM · 2024-02-22T12:40:23Z

So the issue may occur on old omeka s installations.

zerocrates · 2024-02-22T17:55:05Z

Thanks, that's useful information.

zerocrates · 2024-07-02T21:40:59Z

The one specifically for media type since it goes with an added index I don't have any problem with, (that is, the first commit here). The others that are shrunken on a more general or "just in case" basis, I'm less sure on.

For some of those like module versions or job status, I might prefer only (or additionally) setting the collation to latin1 or ascii so it's single-byte and just restricting what the possible values are.... but I'd have to think about whether it's worth it to bother with that. In some ways cutting down the length as you're doing is less risky for some/many/all of these. Where we aren't actually indexing and the value is user-provided like the many label columns here, I'd probably just as soon avoid doing anything unless/until necessary.

zerocrates · 2024-09-10T20:40:57Z

I've cherry-picked the media type index change. I think we'll hold off on changing the others prospectively unless/until we have a specific need for them.

Daniel Berthereau added 3 commits February 19, 2024 00:00

Added an index on media media type.

89a3864

Limited short fields to 190 unicode characters to simplify indexation.

d6ad71b

Limited short labels to 190 unicode characters maximum to simplify in…

3dcde0a

…dexation.

zerocrates closed this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit all fields with varchar of 255 characters to 190 unicode characters. #2151

Limit all fields with varchar of 255 characters to 190 unicode characters. #2151

Daniel-KM commented Feb 19, 2024

zerocrates commented Feb 22, 2024

Daniel-KM commented Feb 22, 2024

Daniel-KM commented Feb 22, 2024

zerocrates commented Feb 22, 2024

zerocrates commented Jul 2, 2024

zerocrates commented Sep 10, 2024

Limit all fields with varchar of 255 characters to 190 unicode characters. #2151

Limit all fields with varchar of 255 characters to 190 unicode characters. #2151

Conversation

Daniel-KM commented Feb 19, 2024

zerocrates commented Feb 22, 2024

Daniel-KM commented Feb 22, 2024

Daniel-KM commented Feb 22, 2024

zerocrates commented Feb 22, 2024

zerocrates commented Jul 2, 2024

zerocrates commented Sep 10, 2024