Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-11874 Add conversion of subdivisions to json #4255

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

arjanmels
Copy link
Contributor

@arjanmels arjanmels commented Dec 22, 2024

CLDR-11874

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@CLAassistant
Copy link

CLAassistant commented Dec 22, 2024

CLA assistant check
All committers have signed the CLA.

@arjanmels arjanmels changed the title CLDR-11874: add conversion of subdivisions to json CLDR-11874 Add conversion of subdivisions to json Dec 22, 2024
@arjanmels arjanmels force-pushed the convert_subdivisions_to_json branch from 3b9404d to 2be4ca9 Compare December 22, 2024 18:35
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@arjanmels arjanmels force-pushed the convert_subdivisions_to_json branch from 2be4ca9 to 5b10a04 Compare January 11, 2025 13:07
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java is now changed in the branch
  • tools/cldr-code/src/main/resources/org/unicode/cldr/json/JSON_config_supplemental.txt is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@arjanmels
Copy link
Contributor Author

Also added containment for subdivisions.

@arjanmels arjanmels removed their assignment Jan 11, 2025
@arjanmels arjanmels force-pushed the convert_subdivisions_to_json branch from 5b10a04 to 9afc856 Compare January 12, 2025 15:31
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/json/Ldml2JsonConverter.java is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/json/LdmlConvertRules.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great need to test it

@srl295 srl295 self-assigned this Jan 13, 2025
@srl295
Copy link
Member

srl295 commented Jan 13, 2025

        "autas": "Tasmania",
        "auvic": "Victoria",
        "auwa": "Western Australia",
        "AW": "Aruba",
        "AX": "Åland Islands",
        "azabs": "Absheron",
        "azaga": "Agstafa",
        "azagc": "Aghjabadi",
        "azagm": "Agdam",

looks like it includes the territories also, not just subdivisions.

@srl295
Copy link
Member

srl295 commented Jan 13, 2025

Files generated

cldr-core/supplemental/subdivisionContainment.json

{
  "supplemental": {
    "version": {
      "_unicodeVersion": "16.0.0",
      "_cldrVersion": "46"
    },
    "subdivisionContainment": {
      "AD": {
        "_contains": [
          "ad02",
          "ad03",
          "ad04",
          "ad05",
          "ad06",
          "ad07",
          "ad08"
        ]
      },

cldr-subdivisions/subdivisions/en.json

{
  "subdivisions": {
    "localeDisplayNames": {
      "subdivisions": {
        "ad02": "Canillo",
        "ad03": "Encamp",
        "ad04": "La Massana",
        "ad05": "Ordino",
        "ad06": "Sant Julià de Lòria",
        "ad07": "Andorra la Vella",
        "ad08": "Escaldes-Engordany",
        "aeaj": "Ajman",
        "aeaz": "Abu Dhabi",
        "aedu": "Dubai",
        "aefu": "Fujairah",
        "aerk": "Ras al-Khaimah",
        "aesh": "Sharjah",
        "aeuq": "Umm al-Quwain",
        "afbal": "Balkh",

Oddly it only generated en.json and ha.json - do you see something different?

I think those are the only two that aren't provisional.

But also it seems like the common/main and common/subdivision trees aren't being merged together properly, because we should see this data from French for example

                <subdivisions>
                        <subdivision type="gbeng">Angleterre</subdivision>
                        <subdivision type="gbsct">Écosse</subdivision>
                        <subdivision type="gbwls">Pays de Galles</subdivision>
                </subdivisions>

@srl295 srl295 requested a review from macchiati January 13, 2025 18:42
@arjanmels
Copy link
Contributor Author

Oddly it only generated en.json and ha.json - do you see something different?
I think those are the only two that aren't provisional.

You're right for my local builds I set: DRAFTSTATUS=provisional in my local-config.sh

@arjanmels
Copy link
Contributor Author

arjanmels commented Jan 13, 2025

looks like it includes the territories also, not just subdivisions.

This is actually already in the source .xml. Not all territories are included (e.g. "NL", "FR", "GB), it seems only territories, which are not countries. Anyway the json is a direct reproduction of the xml.

@srl295
Copy link
Member

srl295 commented Jan 13, 2025

@macchiati indeed it does seem like this is in the source - is this a bug?

https://github.com/unicode-org/cldr/blob/main/common/subdivisions/en.xml#L211-L212

@macchiati
Copy link
Member

        "autas": "Tasmania",
        "auvic": "Victoria",
        "auwa": "Western Australia",
        "AW": "Aruba",
        "AX": "Åland Islands",
        "azabs": "Absheron",
        "azaga": "Agstafa",
        "azagc": "Aghjabadi",
        "azagm": "Agdam",

looks like it includes the territories also, not just subdivisions.

I think what is happening is that some subdivisions are aliases of country codes. We're treating the country codes as predominant (we'll have better data for them). For the JSON version, I think we can skip the country codes; implementers are on notice that they should use the country codes.

@srl295
Copy link
Member

srl295 commented Jan 13, 2025

I think what is happening is that some subdivisions are aliases of country codes. We're treating the country codes as predominant (we'll have better data for them). For the JSON version, I think we can skip the country codes; implementers are on notice that they should use the country codes.

The XML has

			<subdivision type="AW">Aruba</subdivision>
			<subdivision type="AX">Åland Islands</subdivision>
			<!-- AW : Aruba : NO SUBDIVISIONS
				 AX : Åland Islands : NO SUBDIVISIONS
				 AZ : Azerbaijan
			-->

The spec doesn't address it. It looks like it's just included incorrectly. Unless the goal was to define a subdivision that was just "all of aruba"

The spec by the way at https://unicode.org/reports/tr35/dev/tr35-general.html#Contents gives examples with hyphens in the subdivision codes, and doesn't mention the special subdivisions in common/main.

<!-- from the spec -->
<subdivision type="AL-04">Fier County</subdivision>
<subdivision type="AL-FR">Fier</subdivision> <!-- in AL-04 : Fier County -->
<subdivision type="AL-LU">Lushnjë</subdivision> <!-- in AL-04 : Fier County -->
<subdivision type="AL-MK">Mallakastër</subdivision> <!-- in AL-04 : Fier County -->

@macchiati
Copy link
Member

The spec doesn't address it. It looks like it's just included incorrectly.

That's a bug. We'll want to exclude the country codes from the XML data (and via that, the JSON), and also modify the spec to make it clear that implementations need to look at the subdivision aliases, such as

<subdivisionAlias type="usgu" replacement="GU" reason="overlong"/> <!-- Guam => Guam -->

The spec by the way at https://unicode.org/reports/tr35/dev/tr35-general.html#Contents has dashes in the subdivision codes, and doesn't mention the special subdivisions in common/main.

I was confused at first, because the TOC doesn't contain any subdivision codes. But I think you mean https://unicode.org/reports/tr35/dev/tr35-general.html#locale-display-name-fields, notably

<subdivision type="AL-04">Fier County</subdivision>
<subdivision type="AL-FR">Fier</subdivision> <!-- in AL-04 : Fier County -->
<subdivision type="AL-LU">Lushnjë</subdivision> <!-- in AL-04 : Fier County -->
<subdivision type="AL-MK">Mallakastër</subdivision> <!-- in AL-04 : Fier County -->

Is that what you meant?

Anyway, can you file a ticket for these two items? (They are related since we'll want to change the spec to document the relation between the subdivisions and codes.

@srl295
Copy link
Member

srl295 commented Jan 13, 2025

@arjanmels Can you look into:

@srl295
Copy link
Member

srl295 commented Jan 13, 2025

For the data issue: https://unicode-org.atlassian.net/browse/CLDR-18219 - looks like this was fixed before but is a regression

@arjanmels
Copy link
Contributor Author

This was already being properly converted together with the language, territory etc. aliases (before this PR):

       "subdivisionAlias": {
          "cn11": {
            "_reason": "deprecated",
            "_replacement": "cnbj"
          },

@srl295
Copy link
Member

srl295 commented Jan 14, 2025

@arjanmels OK, sorry i missed it.

I'm going to approve this and look at the coverage issue separately

@srl295 srl295 merged commit 7e1b129 into unicode-org:main Jan 14, 2025
12 checks passed
@arjanmels
Copy link
Contributor Author

@srl295 great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants