Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22928 Don't generate data for concentr/perbillion structure #3217

Merged
merged 3 commits into from
Sep 27, 2024

Conversation

mihnita
Copy link
Contributor

@mihnita mihnita commented Sep 27, 2024

Checklist

  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22928
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

@mihnita
Copy link
Contributor Author

mihnita commented Sep 27, 2024

Comments about the binary updates:

The files in /unit/ are expected.

The rest of the files changed because the ICU version was changed from 76.0.1 to 76.1.0

I checked a few .cnv files (hex), and the differnece is the version:

00000010  06 02 00 00 4c 00 01 00  20 43 6f 70 79 72 69 67  |....L... Copyrig|
00000010  06 02 00 00 4c 01 00 00  20 43 6f 70 79 72 69 67  |....L... Copyrig|
                      ^^ ^^ ^^

Finally, the only other file that changed is
main/core/src/main/resources/com/ibm/icu/impl/data/icudata/icuver.res

And the only difference is again the version, in text form:

00000060  6f 6e 00 aa 00 00 00 34  00 36 00 00 00 37 00 36  |on.....4.6...7.6|
00000070  00 2e 00 30 00 2e 00 31  00 2e 00 30 00 00 00 03  |...0...1...0....|
00000070  00 2e 00 31 00 2e 00 30  00 2e 00 30 00 00 00 03  |...1...0...0....|
                   ^^          ^^                               ^   ^

@mihnita
Copy link
Contributor Author

mihnita commented Sep 27, 2024

Note about what was used for data generation.

I was unable to generate files that are identical to what was already in ICU no matter where I checkout the cldr and cldr-stating repos.
I tried beta1, beta2, beta1 + tz data update.

That's because the data generation uses code from cldr, but data from cldr-stating.
The cldr-stating repo does not contain the tz update in beta1.
It got it at beta2, but that contains some other changes that result in differences somewhere else.

In the end this got me identical results:

  1. checkout cldr at tz update
    d07e48b084 2024-09-20 15:19:03 -0700 Christopher Chapman CLDR-17960 updating tzdata to 2024b (#4065)

  2. checkout cldr-staging at release-46-beta1

  3. copied 4 tz files from cldr to cldr-staging ("cherry-pick" the tz update):

    • common/bcp47/timezone.xml
    • common/supplemental/metaZones.xml
    • common/supplemental/supplementalMetadata.xml
    • common/supplemental/windowsZones.xml

The results are now 100% identical, except for the /unit/ files, with the diff there being the ones we expect. And version updates (see previous note)

So this is what you see in this PR.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was very surprising that you had to touch the charset files! Otherwise looks good.

And I assume that you found you needed to delete the localized data, and assume the res files are just generated.

@DraganBesevic
Copy link
Contributor

I noticed the change you made in IcuTextWriter which should filter out all per billion entries. So, if I generate ICU data from current CLDR repo in next integration, I should get no reference to 1e9 in any of ICU data files either, right?

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm tnx!!

@markusicu markusicu changed the title ICU-22928: Don't generate data for concentr/perbillion structure ICU-22928 Don't generate data for concentr/perbillion structure Sep 27, 2024
@markusicu
Copy link
Member

I added the ALLOW_MANY_COMMITS=true to the PR description so that you need not squash.
I fixed the PR title.
@mihnita you still need to amend your commits and remove the colons from the commit messages -- just please don't squash.

@mihnita
Copy link
Contributor Author

mihnita commented Sep 27, 2024

It was very surprising that you had to touch the charset files! Otherwise looks good.

See my comment about the binary data (#3217 (comment))
The difference is the version update.

We changed from 76.0.1 to 76.1.0.

And I assume that you found you needed to delete the localized data, and assume the res files are just generated.

Everything is generated, I didn't tinker with anything "by hand"

@mihnita
Copy link
Contributor Author

mihnita commented Sep 27, 2024

So, if I generate ICU data from current CLDR repo in next integration, I should get no reference to 1e9 in any of ICU data files either, right?

Yes.
You should be able to review that in the .txt files (the second commit).
These are generated with the updated tool.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@mihnita
Copy link
Contributor Author

mihnita commented Sep 27, 2024

@mihnita you still need to amend your commits and remove the colons from the commit messages -- just please don't squash.

Thank you. Done.

git rebase --interactive is magic :-)

@markusicu
Copy link
Member

git rebase --interactive is magic :-)

Yep :-)

And since the file content diffs didn't change, I don't even have to re-approve. Checks all pass.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Mihai

@mihnita mihnita merged commit 808b976 into unicode-org:main Sep 27, 2024
98 checks passed
@mihnita mihnita deleted the mihai_patch_datagen2 branch September 27, 2024 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants