Skip to content

Conversation

@younies
Copy link
Member

@younies younies commented Nov 20, 2025

Description

  1. validateAndGet function checks if a specific unit is available for a given type and subtype, improving performance by avoiding the need to retrieve all available units.
  2. Updated parseMeasureUnitOption to utilize the new function for better error handling when validating measure units.

Checklist

  • Required: Issue filed: ICU-23264
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@younies younies changed the title ICU-23264 Add isAvailable method to MeasureUnit for efficient unit validation ICU-23264 Add validateAndGet method to MeasureUnit for efficient unit validation Nov 20, 2025
@younies younies changed the title ICU-23264 Add validateAndGet method to MeasureUnit for efficient unit validation ICU-23264 Add validateAndGet function to MeasureUnit for efficient unit validation Nov 20, 2025
@younies younies force-pushed the remove-depending-on-limit branch from 1e11c1a to 3008872 Compare November 20, 2025 20:14
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/measunit.cpp is different
  • icu4c/source/i18n/number_skeletons.cpp is different
  • icu4c/source/i18n/unicode/measunit.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/icu that referenced this pull request Nov 20, 2025
@younies younies force-pushed the remove-depending-on-limit branch from 3008872 to c8b4ce7 Compare November 20, 2025 20:14
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/icu that referenced this pull request Nov 20, 2025
@younies younies force-pushed the remove-depending-on-limit branch from c8b4ce7 to a3d8ce5 Compare November 20, 2025 20:15
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

…lidation

This new method checks if a specific unit is available for a given type and subtype, improving performance by avoiding the need to retrieve all available units. Updated parseMeasureUnitOption to utilize this method for better error handling when validating measure units.
@younies younies force-pushed the remove-depending-on-limit branch from a3d8ce5 to 533d5ba Compare November 20, 2025 22:37
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/measunit.cpp is different
  • icu4c/source/i18n/number_skeletons.cpp is different
  • icu4c/source/i18n/unicode/measunit.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu markusicu self-assigned this Nov 21, 2025
@markusicu markusicu requested a review from richgillam November 21, 2025 17:58
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for jumping on this so quickly!

  • The changes lgtm.
  • Please add a unit test that fails without these changes.
  • I assume that Java has a similar bug or opportunity?

}

// Find the subtype within the type's range using binary search
int32_t subtypeIdx = binarySearch(gSubTypes, gOffsets[typeIdx], gOffsets[typeIdx + 1], subtype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does gOffsets[] have an entry at max typeIndex + 1?

Copy link
Member Author

@younies younies Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, gOffsets[] includes an entry at max typeIndex + 1.

Evidence:

  1. gTypes[] contains 24 elements (indices 0–23), so the maximum valid typeIndex is 23.
  2. gOffsets[] contains 25 elements (indices 0–24), as defined in measunit.cpp (lines 39–65).
  3. gSubTypes[] has 538 elements (indices 0-537)
  4. The last element, gOffsets[24] = 538, serves as the end boundary (i.e., total size).
  5. The existing code safely accesses gOffsets[typeIndex + 1] at line 2683:
    int32_t len = gOffsets[typeIndex + 1] - gOffsets[typeIndex];
  6. An assertion at line 2753 validates this arrangement:
    U_ASSERT(gOffsets[UPRV_LENGTHOF(gOffsets) - 1] == UPRV_LENGTHOF(gSubTypes));

However, I think it would be beneficial to:
• Add a comment next to the sentinel value 538 to clarify its purpose (e.g., "end boundary").
• Consider using a helper function to retrieve the range for a specific typeIndex, instead of directly using gOffsets[typeIndex] and gOffsets[typeIndex + 1].

@younies
Copy link
Member Author

younies commented Nov 24, 2025

Thanks for jumping on this so quickly!

  • The changes lgtm.

thanks

  • Please add a unit test that fails without these changes.

Sure, I will add one

  • I assume that Java has a similar bug or opportunity?

No, Java does not have the same bug .

The Java implementation uses a fundamentally different approach:

  1. Java's MeasureUnit.getAvailable(type) returns a Set, not a fixed array
  2. The skeleton parsing in NumberSkeletonImpl.java (line 1098) uses:
   Set<MeasureUnit> units = MeasureUnit.getAvailable(type);
   for (MeasureUnit unit : units) {
       if (subType.equals(unit.getSubtype())) { ... }
   }
  1. No CAPACITY constant or buffer overflow risks exist in Java
  2. The inefficiency is O(n) iteration over the Set.

I think Java version has opportuniy to benefit from a direct lookup method like:
MeasureUnit.forTypeAndSubtype(type, subtype)

let me see what I can do there

@macchiati
Copy link
Member

There can be an even better optimization. The Types are completely superfluous for all processing except looking up the unit names in CLDR.

The most efficient implementation in Java would be to have subtypes be enums (eg UnitSubtype). All internal calculations would just use those enums. Using EnumSet for sets of elements, the lookup is O(1), even better than hash lookups. Same for EnumMaps.

Each UnitSubtype would have-a String type and a MeasureUnitImpl measureUnitImpl. A MeasureUnit would have exactly one field, a UnitSubtype. (The only reason for keeping the MessageUnit API is for backwards compatibility.)

The only downside to this is that anytime CLDR adds units, the UnitSubtype would need to be extended. It would be straightforward however, to generate the enums from the CLDR units.xml file when integrating.

CC @sffc

@younies
Copy link
Member Author

younies commented Nov 30, 2025

I will be OOO next two weeks, so I need to add an update

@markusicu , I have addressed all of your comments, and still need to adjust some parts of my implementation in the C++ version

@macchiati : Thanks a lot for your suggestions, I will consider them once I return back. FYI, in the java version, the search already is O(1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants