-
-
Notifications
You must be signed in to change notification settings - Fork 844
ICU-23264 Add validateAndGet function to MeasureUnit for efficient unit validation #3780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1e11c1a to
3008872
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
3008872 to
c8b4ce7
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
c8b4ce7 to
a3d8ce5
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
…lidation This new method checks if a specific unit is available for a given type and subtype, improving performance by avoiding the need to retrieve all available units. Updated parseMeasureUnitOption to utilize this method for better error handling when validating measure units.
a3d8ce5 to
533d5ba
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
markusicu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for jumping on this so quickly!
- The changes lgtm.
- Please add a unit test that fails without these changes.
- I assume that Java has a similar bug or opportunity?
icu4c/source/i18n/measunit.cpp
Outdated
| } | ||
|
|
||
| // Find the subtype within the type's range using binary search | ||
| int32_t subtypeIdx = binarySearch(gSubTypes, gOffsets[typeIdx], gOffsets[typeIdx + 1], subtype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does gOffsets[] have an entry at max typeIndex + 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, gOffsets[] includes an entry at max typeIndex + 1.
Evidence:
- gTypes[] contains 24 elements (indices 0–23), so the maximum valid typeIndex is 23.
- gOffsets[] contains 25 elements (indices 0–24), as defined in measunit.cpp (lines 39–65).
- gSubTypes[] has 538 elements (indices 0-537)
- The last element, gOffsets[24] = 538, serves as the end boundary (i.e., total size).
- The existing code safely accesses gOffsets[typeIndex + 1] at line 2683:
int32_t len = gOffsets[typeIndex + 1] - gOffsets[typeIndex]; - An assertion at line 2753 validates this arrangement:
U_ASSERT(gOffsets[UPRV_LENGTHOF(gOffsets) - 1] == UPRV_LENGTHOF(gSubTypes));
However, I think it would be beneficial to:
• Add a comment next to the sentinel value 538 to clarify its purpose (e.g., "end boundary").
• Consider using a helper function to retrieve the range for a specific typeIndex, instead of directly using gOffsets[typeIndex] and gOffsets[typeIndex + 1].
thanks
Sure, I will add one
No, Java does not have the same bug . The Java implementation uses a fundamentally different approach:
Set<MeasureUnit> units = MeasureUnit.getAvailable(type);
for (MeasureUnit unit : units) {
if (subType.equals(unit.getSubtype())) { ... }
}
I think Java version has opportuniy to benefit from a direct lookup method like: let me see what I can do there |
|
There can be an even better optimization. The Types are completely superfluous for all processing except looking up the unit names in CLDR. The most efficient implementation in Java would be to have subtypes be enums (eg UnitSubtype). All internal calculations would just use those enums. Using EnumSet for sets of elements, the lookup is O(1), even better than hash lookups. Same for EnumMaps. Each UnitSubtype would have-a String type and a MeasureUnitImpl measureUnitImpl. A MeasureUnit would have exactly one field, a UnitSubtype. (The only reason for keeping the MessageUnit API is for backwards compatibility.) The only downside to this is that anytime CLDR adds units, the UnitSubtype would need to be extended. It would be straightforward however, to generate the enums from the CLDR units.xml file when integrating. CC @sffc |
|
I will be OOO next two weeks, so I need to add an update @markusicu , I have addressed all of your comments, and still need to adjust some parts of my implementation in the C++ version @macchiati : Thanks a lot for your suggestions, I will consider them once I return back. FYI, in the java version, the search already is O(1). |
Description
validateAndGetfunction checks if a specific unit is available for a given type and subtype, improving performance by avoiding the need to retrieve all available units.parseMeasureUnitOptionto utilize the new function for better error handling when validating measure units.Checklist