From 30eeebea3c4cd9b6d228c91021a8e490b876aa62 Mon Sep 17 00:00:00 2001 From: Peter Edberg <42151464+pedberg-icu@users.noreply.github.com> Date: Wed, 5 Oct 2022 16:47:51 -0700 Subject: [PATCH] CLDR-16066 fix LDML 42 bad links found by W3C checkLink (#2429) --- docs/ldml/tr35-collation.md | 6 +-- docs/ldml/tr35-general.md | 4 +- docs/ldml/tr35-info.md | 2 +- docs/ldml/tr35-keyboards.md | 8 +-- docs/ldml/tr35-personNames.md | 2 +- docs/ldml/tr35.md | 92 +++++++++++++++++------------------ 6 files changed, 57 insertions(+), 57 deletions(-) diff --git a/docs/ldml/tr35-collation.md b/docs/ldml/tr35-collation.md index 9b3fac9ba27..70685c52589 100644 --- a/docs/ldml/tr35-collation.md +++ b/docs/ldml/tr35-collation.md @@ -287,7 +287,7 @@ In CLDR, so as to maintain the special collation elements, **U+FFFD..U+FFFF** ar ### 2.5 Root Collation Data Files -The CLDR root collation data files are in the CLDR repository and release, under the path [common/uca/](https://github.com/unicode-org/cldr/tree/main/common/uca/). +The CLDR root collation data files are in the CLDR repository and release, under the path [common/uca/](https://github.com/unicode-org/cldr/blob/main/common/uca/). For most data files there are **\_SHORT** versions available. They contain the same data but only minimal comments, to reduce the file sizes. @@ -547,7 +547,7 @@ A collation type name that starts with "private-", for example, "private-kana", > 👉 **Note**: There is an on-line demonstration of collation at [[LocaleExplorer](tr35.md#LocaleExplorer)] that uses the same rule syntax. (Pick the locale and scroll to "Collation Rules", near the end.) -> 👉 **Note**: In CLDR 23 and before, LDML collation files used an XML format. Starting with CLDR 24, the XML collation syntax is deprecated and no longer used. See the _[CLDR 23 version of this document](https://www.unicode.org/reports/tr35/tr35-31/tr35-collation.md#Collation_Tailorings)_ for details about the XML collation syntax. +> 👉 **Note**: In CLDR 23 and before, LDML collation files used an XML format. Starting with CLDR 24, the XML collation syntax is deprecated and no longer used. See the _[CLDR 23 version of this document](https://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings)_ for details about the XML collation syntax. #### 3.1.1 Collation Type Fallback @@ -919,7 +919,7 @@ The reason that these are not settings is so that their contents can be arbitrar _Example:_ -The following is a simple example that combines portions of different tailorings for illustration. For more complete examples, see the actual locale data: [Japanese](https://github.com/unicode-org/cldr/tree/main/common/collation/ja.xml), [Chinese](https://github.com/unicode-org/cldr/tree/main/common/collation/zh.xml), [Swedish](https://github.com/unicode-org/cldr/tree/main/common/collation/sv.xml), and [German](https://github.com/unicode-org/cldr/tree/main/common/collation/de.xml) (type="phonebook") are particularly illustrative. +The following is a simple example that combines portions of different tailorings for illustration. For more complete examples, see the actual locale data: [Japanese](https://github.com/unicode-org/cldr/blob/main/common/collation/ja.xml), [Chinese](https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml), [Swedish](https://github.com/unicode-org/cldr/blob/main/common/collation/sv.xml), and [German](https://github.com/unicode-org/cldr/blob/main/common/collation/de.xml) (type="phonebook") are particularly illustrative. ```xml diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md index 0800bd26377..fcbf70e674a 100644 --- a/docs/ldml/tr35-general.md +++ b/docs/ldml/tr35-general.md @@ -2547,7 +2547,7 @@ Some examples for English data (v30) are given in the following table. | 🚴‍♀️ | woman biking | cyclist, woman, bicycle, biking | | 🚴🏿‍♀️ | woman biking: dark skin tone | cyclist, woman, bicycle, biking, dark skin tone | -For more information, see [Unicode Emoji](https://www.unicode.org/reports/tr51). +For more information, see [Unicode Emoji](https://www.unicode.org/reports/tr51/). ### 14.2 Annotations Character Labels @@ -2734,7 +2734,7 @@ Thus it bundles noun class categories such as gender and animacy into a single i | inanimate | In an animate/inanimate gender system, gender that denotes object or inanimate entities .| adapted from: [wikipedia.org/wiki/Grammatical_gender](https://en.wikipedia.org/wiki/Grammatical_gender), [linguistics-ontology.org/gold/2010/InanimateGender](http://linguistics-ontology.org/gold/2010/InanimateGender) | | personal | In an animate/inanimate gender system in some languages, gender that specifies the masculine gender of animate entities. | adapted from: [wikipedia.org/wiki/Grammatical_gender](https://en.wikipedia.org/wiki/Grammatical_gender), [linguistics-ontology.org/gold/2010/HumanGender](http://linguistics-ontology.org/gold/2010/HumanGender) | | common | In a common/neuter gender system, gender that denotes human entities. | adapted from: [wikipedia.org/wiki/Grammatical_gender](https://en.wikipedia.org/wiki/Grammatical_gender) | -| feminine | In a masculine/feminine or in a masculine/feminine/neuter gender system, gender that denotes specifically female persons (or animals) or that is assigned arbitrarily to object. | adapted from: https://wikipedia.org/wiki/Grammatical_gender, [linguistics-ontology.org/gold/2010/FeminineGender](http://linguistics-ontology.org/gold/2010/FeminineGender) | +| feminine | In a masculine/feminine or in a masculine/feminine/neuter gender system, gender that denotes specifically female persons (or animals) or that is assigned arbitrarily to object. | adapted from: https://en.wikipedia.org/wiki/Grammatical_gender, [linguistics-ontology.org/gold/2010/FeminineGender](http://linguistics-ontology.org/gold/2010/FeminineGender) | | masculine | In a masculine/feminine or in a masculine/feminine/neuter gender system, gender that denotes specifically male persons (or animals) or that is assigned arbitrarily to object. | adapted from: [wikipedia.org/wiki/Grammatical_gender](https://en.wikipedia.org/wiki/Grammatical_gender), [linguistics-ontology.org/gold/2010/MasculineGender](http://linguistics-ontology.org/gold/2010/MasculineGender) | | neuter | In a masculine/feminine/neuter or common/neuter gender system, gender that generally denotes an object. | adapted from: [wikipedia.org/wiki/Grammatical_gender](https://en.wikipedia.org/wiki/Grammatical_gender), [linguistics-ontology.org/gold/2010/NeuterGender](http://linguistics-ontology.org/gold/2010/NeuterGender) | diff --git a/docs/ldml/tr35-info.md b/docs/ldml/tr35-info.md index ba18088bf20..8ffefc9d880 100644 --- a/docs/ldml/tr35-info.md +++ b/docs/ldml/tr35-info.md @@ -388,7 +388,7 @@ The alphabetic codes are only provided where different from the type. For exampl Where there is no corresponding code, sometimes private use codes are used, such as the numeric code for XK. -The currencyCodes are mappings from three letter currency codes to numeric values (ISO 4217 [Current currency & funds code list](https://www.currency-iso.org/en/home/tables/table-a1.html)). The mapping currently covers only current codes and does not include historic currencies. For example: +The currencyCodes are mappings from three letter currency codes to numeric values (ISO 4217, see [Current currency & funds code list](https://www.six-group.com/en/products-services/financial-information/data-standards.html#scrollTo=maintenance-agency)). The mapping currently covers only current codes and does not include historic currencies. For example: ```xml diff --git a/docs/ldml/tr35-keyboards.md b/docs/ldml/tr35-keyboards.md index efe96cc85d7..124b62cd76c 100644 --- a/docs/ldml/tr35-keyboards.md +++ b/docs/ldml/tr35-keyboards.md @@ -1900,8 +1900,8 @@ Here is a list of the data sources used to generate the initial key map layouts: | Platform | Source | Notes | |----------|--------|-------| -| Android | Android 4.0 - Ice Cream Sandwich ([https://source.android.com/source/downloading.html](https://source.android.com/source/downloading.html)) | Parsed layout files located in packages/inputmethods/LatinIME/java/res | -| ChromeOS | XKB ([https://www.x.org/wiki/XKB](https://www.x.org/wiki/XKB)) | The ChromeOS represents a very small subset of the keyboards available from XKB. +| Android | Android 4.0 - Ice Cream Sandwich ([https://source.android.com/docs/setup/download/downloading](https://source.android.com/docs/setup/download/downloading)) | Parsed layout files located in packages/inputmethods/LatinIME/java/res | +| ChromeOS | XKB ([https://www.x.org/wiki/XKB/](https://www.x.org/wiki/XKB/)) | The ChromeOS represents a very small subset of the keyboards available from XKB. | Mac OSX | Ukelele bundled System Keyboards ([https://software.sil.org/ukelele/](https://software.sil.org/ukelele/)) | These layouts date from Mac OSX 10.4 and are therefore a bit outdated | | Windows | Generated .klc files from the [Microsoft Keyboard Layout Creator](https://www.microsoft.com/en-us/download/details.aspx?id=102134) | @@ -1919,10 +1919,10 @@ The following are the design principles for the ids. 1. Eg, `en-t-k0-extended`. 2. Use the minimal language id based on `likelySubtag`s. 1. Eg, instead of `en-US-t-k0-xxx`, use `en-t-k0-xxx`. Because there is ``, en-US → en. - 2. The data is in + 2. The data is in 3. The platform goes first, if it exists. If a keyboard on the platform changes over time, both are dated, eg `bg-t-k0-chromeos-2011`. When selecting, if there is no date, it means the latest one. 4. Keyboards are only tagged that differ from the "standard for each platform". That is, for each language on a platform, there will be a keyboard with no subtags other than the platform. Subtags with a common semantics across platforms are used, such as `-extended`, `-phonetic`, `-qwerty`, `-qwertz`, `-azerty`, … -5. In order to get to 8 letters, abbreviations are reused that are already in [bcp47](https://github.com/unicode-org/cldr/tree/main/common/bcp47/) -u/-t extensions and in [language-subtag-registry](https://www.iana.org/assignments/language-subtag-registry) variants, eg for Traditional use `-trad` or `-traditio` (both exist in [bcp47](https://github.com/unicode-org/cldr/tree/main/common/bcp47/)). +5. In order to get to 8 letters, abbreviations are reused that are already in [bcp47](https://github.com/unicode-org/cldr/blob/main/common/bcp47/) -u/-t extensions and in [language-subtag-registry](https://www.iana.org/assignments/language-subtag-registry) variants, eg for Traditional use `-trad` or `-traditio` (both exist in [bcp47](https://github.com/unicode-org/cldr/blob/main/common/bcp47/)). 6. Multiple languages cannot be indicated, so the predominant target is used. 1. For Finnish + Sami, use `fi-t-k0-smi` or `extended-smi` 7. In some cases, there are multiple subtags, like `en-US-t-k0-chromeos-intl-altgr.xml` diff --git a/docs/ldml/tr35-personNames.md b/docs/ldml/tr35-personNames.md index 8cdb835bfdc..c272e40472d 100644 --- a/docs/ldml/tr35-personNames.md +++ b/docs/ldml/tr35-personNames.md @@ -134,7 +134,7 @@ A Tech Preview API for formatting personal names is included in ICU. The impleme Logically, the model used for applying the CLDR data is the following: -![diagram showing relationship of components involved in person name formatting](images/personNamesFormattingModel.png) +![diagram showing relationship of components involved in person name formatting](images/personNamesFormatModel.png) Conceptually, CLDR person name formatting depends on data supplied by a PersonName Data Interface. That could be a very thin interface that simply accesses a database record, or it could be a more sophisticated interface that can modify the raw data before presenting it to be formatted. For example, based on the formatting locale a PersonName data interface could transliterate names that are in another script, or supply equivalent titles in different languages. diff --git a/docs/ldml/tr35.md b/docs/ldml/tr35.md index 060bcbf1343..6a6d1146916 100644 --- a/docs/ldml/tr35.md +++ b/docs/ldml/tr35.md @@ -6,7 +6,7 @@ - + @@ -191,8 +191,8 @@ The LDML specification is divided into the following parts: * [5. Canonicalizing Syntax](#5.-canonicalizing-syntax) * [Preprocessing](#preprocessing) * [Processing LanguageIds](#processing-languageids) -* [Processing LocaleIds](#processing-localeids) -* [Optimizations](#optimizations) + * [Processing LocaleIds](#processing-localeids) + * [Optimizations](#optimizations) * [References](#References) * [Acknowledgments](#Acknowledgments) * [Modifications](#Modifications) @@ -364,9 +364,9 @@ A [`unicode_locale_id`](#unicode_locale_id) has _canonical syntax_ when: For example, the canonical form of "en-u-foo-bar-nu-thai-ca-buddhist-kk-true" is "en-u-bar-foo-ca-buddhist-kk-nu-thai". The attributes `"foo"` and `"bar"` in this example are provided only for illustration; no attribute subtags are defined by the current CLDR specification. -NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in [Section 4.1](https://tools.ietf.org/search/bcp47#section-4.1) of BCP 47. Here are the considerations that lead to that decision: +NOTE: Some people may wonder why CLDR uses alphabetical order for variants, rather than the ordering in [Section 4.1](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1) of BCP 47. Here are the considerations that lead to that decision: * The ordering in Section 4.1 is recommended, but not required for conformance. In particular, use of and ordering by Prefix is recommended but not required. - * Moreover, [Section 4.5](https://tools.ietf.org/search/bcp47#section-4.5) states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.” + * Moreover, [Section 4.5](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.5) states that “If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation.” * The best practices for internationalization have moved well beyond some of the guidelines and recommendations in BCP 47, especially for language matching and language fallback. * Robust implementations will accept the variants in any order, just as they accept extensions in any order. * A canonical order allows for determination of identity of identifiers via string comparison. @@ -399,12 +399,12 @@ The equivalence relationship may change over time, such as when subtags are depr Unicode language and locale identifiers inherit the design and the repertoire of subtags from [[BCP47](#BCP47)] Language Tags. There are some extensions and restrictions made for the use of the Unicode locale identifier in CLDR: * It does not allow for the full syntax of [[BCP47](#BCP47)]: - * No extlang subtags are allowed (as in the BCP 47 canonical form, see BCP 47 [Section 4.5](https://tools.ietf.org/search/bcp47#section-4.5) and [Section 3.1.7](https://tools.ietf.org/search/bcp47#section-3.1.7)) + * No extlang subtags are allowed (as in the BCP 47 canonical form, see BCP 47 [Section 4.5](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.5) and [Section 3.1.7](https://www.rfc-editor.org/rfc/rfc5646.html#section-3.1.7)) * No irregular BCP 47 legacy language tags (marked as “Type: grandfathered” in BCP 47) are allowed (these are all deprecated in BCP 47) * A tag must not start with the subtag "x": thus a _privateuse_ (eg x-abc) can only be after a language subtag, like "und" * It allows for certain semantic additions and constraints: * Certain codes that are private-use in BCP 47 and ISO are given semantics by LDML - * Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see [Section 4.1.2](https://tools.ietf.org/search/bcp47#section-4.1.2)) + * Each macrolanguage has an identified primary encompassed language, which is treated as an alias for the macrolanguage, and thus is replaced when canonicalizing (as allowed by BCP 47, see [Section 4.1.2](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.1.2)) * It allows certain syntax for backwards compatibility (not BCP 47-compatible): * The "\_" character for field separator characters, as well as the "-" used in [[BCP47](#BCP47)] (however, the canonical form is with "-") * The subtag "root" to indicate the generic locale used as the parent of all languages in the CLDR data model ("und" can be used instead) @@ -474,7 +474,7 @@ _Examples:_ ##### Truncation -BCP 47 requires that implementations allow for language tags of at least 35 characters, in [Section 4.1.1](https://tools.ietf.org/search/bcp47#section-4.4.1). +BCP 47 requires that implementations allow for language tags of at least 35 characters, in [Section 4.1.1](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.4.1). To allow for use of extensions, CLDR extends that minimum to 255 for Unicode locale identifiers. Theoretically, a language tag could be far longer, due to the possibility of a large number of variants and extensions. In practice, the typical size of a locale or language identifier will be much smaller, so implementations can optimize for smaller sizes, as long as there is an escape mechanism allowing for up to 255. @@ -709,7 +709,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other
Version42 (draft)
EditorsMark Davis (markdavis@google.com) and other CLDR committee members
Date2022-09-27
Date2022-10-05
This Versionhttps://www.unicode.org/reports/tr35/tr35-67/tr35.html
Previous Versionhttps://www.unicode.org/reports/tr35/tr35-66/tr35.html
Latest Versionhttps://www.unicode.org/reports/tr35/
- + @@ -729,13 +729,13 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + - + @@ -754,20 +754,20 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + - + - + @@ -776,7 +776,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -787,7 +787,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -799,7 +799,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -809,7 +809,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - + @@ -821,7 +821,7 @@ The BCP 47 form for keys and types is the canonical form, and recommended. Other - @@ -834,7 +834,7 @@ The determination of preferred units depends on the locale identifer: the keys m - @@ -845,7 +845,7 @@ The determination of preferred units depends on the locale identifer: the keys m - + @@ -878,7 +878,7 @@ The determination of preferred units depends on the locale identifer: the keys m - + @@ -886,7 +886,7 @@ The determination of preferred units depends on the locale identifer: the keys m - + @@ -894,7 +894,7 @@ The determination of preferred units depends on the locale identifer: the keys m

For more information, see Section 3.6.3 Time Zone Identifiers.

CLDR provides data for normalizing timezone codes.

- + @@ -908,9 +908,9 @@ Additional keys or types might be added in future versions. Implementations of L #### 3.6.2 Numbering System Data -LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the file **bcp47/number.xml**. For example, for the latest version of the data see [bcp47/number.xml](https://github.com/unicode-org/cldr/tree/main/common/bcp47/number.xml). +LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the file **bcp47/number.xml**. For example, for the latest version of the data see [bcp47/number.xml](https://github.com/unicode-org/cldr/blob/main/common/bcp47/number.xml). -Details about those numbering systems are defined in **supplemental/numberingSystems.xml**. For example, for the latest version of the data see [supplemental/numberingSystems.xml](https://github.com/unicode-org/cldr/tree/main/common/supplemental/numberingSystems.xml). +Details about those numbering systems are defined in **supplemental/numberingSystems.xml**. For example, for the latest version of the data see [supplemental/numberingSystems.xml](https://github.com/unicode-org/cldr/blob/main/common/supplemental/numberingSystems.xml). LDML makes certain stability guarantees on this data: @@ -1301,7 +1301,7 @@ Even though localization should be done as close to the end-user as possible, th #### 3.9.1 Message Formatting and Exceptions -Windows ([FormatMessage](https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-formatmessage), [String.Format](https://docs.microsoft.com/en-us/dotnet/api/system.string.format)), Java ([MessageFormat](https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html)) and ICU ([MessageFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classMessageFormat.html), [umsg](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/umsg_8h.html)) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues. +Windows ([FormatMessage](https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-formatmessage), [String.Format](https://learn.microsoft.com/en-us/dotnet/api/system.string.format?view=net-6.0)), Java ([MessageFormat](https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html)) and ICU ([MessageFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classMessageFormat.html), [umsg](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/umsg_8h.html)) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues. There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization. @@ -1340,7 +1340,7 @@ Note that the language of locale data may differ from the language of localized #### 3.10.2 Hybrid Locale Identifiers -Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. These are commonly referred to with portmanteau words such as _Franglais, [​Spanglish](https://en.oxforddictionaries.com/definition/spanglish)_ or _Denglish_. Hybrid locales do not _not_ reference text simply containing two languages: a book of parallel text containing English and French, such as the following, is not Franglais: +Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. These are commonly referred to with portmanteau words such as _Franglais, [​Spanglish](https://en.wikipedia.org/wiki/Spanglish)_ or _Denglish_. Hybrid locales do not _not_ reference text simply containing two languages: a book of parallel text containing English and French, such as the following, is not Franglais:
key
(old key name)
key descriptionexample type
(old type name)
type description
A Unicode Calendar Identifier defines a type of calendar. The valid values are those name attribute values in the type elements of key name="ca" in bcp47/calendar.xml.
A Unicode Calendar Identifier defines a type of calendar. The valid values are those name attribute values in the type elements of key name="ca" in bcp47/calendar.xml.
"ca"
(calendar)
Calendar algorithm

(For information on the calendar algorithms associated with the data used with these, see [Calendars].)
"buddhist"
…
Note: Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura".
A Unicode Currency Format Identifier defines a style for currency formatting. The valid values are those name attribute values in the type elements of key name="cf" in bcp47/currency.xml.
A Unicode Currency Format Identifier defines a style for currency formatting. The valid values are those name attribute values in the type elements of key name="cf" in bcp47/currency.xml.
"cf" Currency Format style "standard"Negative numbers use the minusSign symbol (the default).
"account"Negative numbers use parentheses or equivalent.
A Unicode Collation Identifier defines a type of collation (sort order). The valid values are those name attribute values in the type elements of bcp47/collation.xml.
A Unicode Collation Identifier defines a type of collation (sort order). The valid values are those name attribute values in the type elements of bcp47/collation.xml.
For information on each collation setting parameter, from ka to vt, see Setting Options
"co"
(collation)
Collation typeSpecial collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search as described in the [UCA] section Asymmetric Search and obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant.
…
A Unicode Currency Identifier defines a type of currency. The valid values are those name attribute values in the type elements of key name="cu" in bcp47/currency.xml.
A Unicode Currency Identifier defines a type of currency. The valid values are those name attribute values in the type elements of key name="cu" in bcp47/currency.xml.
"cu"
(currency)
Currency type ISO 4217 code,

plus others in common use

Codes consisting of 3 ASCII letters that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The list of countries and time periods associated with each currency value is available in Supplemental Currency Data, plus the default number of decimals.

The XXX code is given a broader interpretation as Unknown or Invalid Currency.

A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml.
A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml.
"dx" Dictionary break script exclusions unicode_script_subtag values

One or more items of type SCRIPT_CODE, which are valid unicode_script_subtag values.

The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified.

A Unicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <html lang="sr-Latn-u-em-emoji">. The valid values are those name attribute values in the type elements of key name="em" in bcp47/variant.xml.
A Unicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <html lang="sr-Latn-u-em-emoji">. The valid values are those name attribute values in the type elements of key name="em" in bcp47/variant.xml.
"em" Emoji presentation style "emoji"Use a text presentation for emoji characters if possible.
"default"Use the default presentation for emoji characters as specified in UTR #51 Section 4, Presentation Style.
A Unicode First Day Identifier defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data (see Part 4 Dates, section 4.3 Week Data). The valid values are those name attribute values in the type elements of key name="fw" in bcp47/calendar.xml.
A Unicode First Day Identifier defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data (see Part 4 Dates, section 4.3 Week Data). The valid values are those name attribute values in the type elements of key name="fw" in bcp47/calendar.xml.
"fw" First day of week "sun"
"sat" Saturday
A Unicode Hour Cycle Identifier defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data (see Part 4 Dates, section 4.4 Time Data). The valid values are those name attribute values in the type elements of key name="hc" in bcp47/calendar.xml.
A Unicode Hour Cycle Identifier defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data (see Part 4 Dates, section 4.4 Time Data). The valid values are those name attribute values in the type elements of key name="hc" in bcp47/calendar.xml.
"hc" Hour cycle "h12"
"h24" Hour system using 1–24; corresponds to 'k' in pattern
A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml.
A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale’s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml.
"lb" Line break style "strict"
"loose" CSS lev 3 line-break=loose
A Unicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option. The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml.
A Unicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option. The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml.
"lw" Line break word handling "normal"
"phrase" Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline
A Unicode Measurement System Identifier defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data (see Part 2 General, section 5 Measurement System Data). The valid values are those name attribute values in the type elements of key name="ms" in bcp47/measure.xml. +
A Unicode Measurement System Identifier defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data (see Part 2 General, section 5 Measurement System Data). The valid values are those name attribute values in the type elements of key name="ms" in bcp47/measure.xml. The determination of preferred units depends on the locale identifer: the keys ms, mu, rg, the base locale (language, script, region) and the user preferences. For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences.
"uksystem" UK System of measurement: feet, pints, etc.; pints are 20oz
A Measurement Unit Preference Override defines an override for measurement unit preference. The valid values are those name attribute values in the type elements of key name="mu" in bcp47/measure.xml. +
A Measurement Unit Preference Override defines an override for measurement unit preference. The valid values are those name attribute values in the type elements of key name="mu" in bcp47/measure.xml. For information about preferred units and unit conversion, see Unit Conversion and Unit Preferences.
"mu" Measurement unit override
"fahrenhe" Fahrenheit as temperature unit
A Unicode Number System Identifier defines a type of number system. The valid values are those name attribute values in the type elements of bcp47/number.xml.
A Unicode Number System Identifier defines a type of number system. The valid values are those name attribute values in the type elements of bcp47/number.xml.
"nu"
(numbers)
Numbering system Unicode script subtagA unicode_subdivision_id, which is a unicode_region_subtag concatenated with a unicode_subdivision_suffix.
For example, gbsct is “gb”+“sct” (where sct represents the subdivision code for Scotland). Thus “en-GB-u-sd-gbsct” represents the language variant “English as used in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca” represent “English as used in California”. See 3.6.5 Subdivision Codes.
…
A Unicode Sentence Break Suppressions Identifier defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules. The valid values are those name attribute values in the type elements of key name="ss" in bcp47/segmentation.xml.
A Unicode Sentence Break Suppressions Identifier defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules. The valid values are those name attribute values in the type elements of key name="ss" in bcp47/segmentation.xml.
"ss" Sentence break suppressions "none"
"standard" Use sentence break suppressions data of type "standard"
A Unicode Timezone Identifier defines a timezone. The valid values are those name attribute values in the type elements of bcp47/timezone.xml.
A Unicode Timezone Identifier defines a timezone. The valid values are those name attribute values in the type elements of bcp47/timezone.xml.
"tz"
(timezone)
Time zone Unicode short time zone IDs
A Unicode Variant Identifier defines a special variant used for locales. The valid values are those name attribute values in the type elements of bcp47/variant.xml.
A Unicode Variant Identifier defines a special variant used for locales. The valid values are those name attribute values in the type elements of bcp47/variant.xml.
"va" Common variant type "posix"
@@ -1395,7 +1395,7 @@ Should there ever be strong need for hybrids of more than two languages or for o ``` -The directory [common/validity](https://github.com/unicode-org/cldr/tree/main/common/validity/) contains machine-readable data for validating the language, region, script, and variant subtags, as well as currency, subdivisions and measure units. Each file contains a number of subtags with the following **idStatus** values: +The directory [common/validity](https://github.com/unicode-org/cldr/blob/main/common/validity/) contains machine-readable data for validating the language, region, script, and variant subtags, as well as currency, subdivisions and measure units. Each file contains a number of subtags with the following **idStatus** values: * **regular** — the standard codes used for the specific type of subtag * **special** — certain exceptional language codes like 'mul' _(languages only)_ @@ -3435,7 +3435,7 @@ The `languageAlias`, `scriptAlias`, `territoryAlias`, and `variantAlias` element ### LocaleId Definitions -#### 1. Multimap interpretation +#### 1. Multimap interpretation Interpret each languageId as a multimap from a _fieldId_ (language, script, region, variants) to a **sorted set** of field values. @@ -3451,7 +3451,7 @@ _Examples:_ * “und” is a special language code that is treated as an empty set. * Of course, only the Variants can contain more than one item: the others are either empty or contain exactly 1 item. -#### 2. Alias elements +#### 2. Alias elements For the `languageAlias` elements, the _type_ and _replacements_ are languageIds. @@ -3469,7 +3469,7 @@ is interpreted as: Note that for the case of territoryAlias, there may be multiple replacement values separated by spaces in the text (such as replacement="und-CW und-SX und-BQ"); other rules only ever have a single replacement value. -#### 3. Matches +#### 3. Matches A rule matches a source if and only for all fields, each _source_ field ⊇ _type_ field. @@ -3493,7 +3493,7 @@ so the rule matches the source. (Note that order of variants is immaterial to ma so the rule does not match the source. -#### 4. Replacement +#### 4. Replacement A matching rule can be used to transform the source fields as follows @@ -3512,11 +3512,11 @@ _Example:_ > > result="ja-Latn-alalc97-fonipa" // note that CLDR canonical order of variants is alphabetical -##### Territory Exception +##### Territory Exception If the field = territory, and the replacement.field has more than one value, then look up the most likely territory for the base language code (and script, if there is one). If that likely territory is in the list of replacements, use it. Otherwise, use the first territory in the list. -#### 5. Canonicalizing Syntax +#### 5. Canonicalizing Syntax To canonicalize the syntax of _source_: @@ -3536,7 +3536,7 @@ To canonicalize the syntax of _source_: * Separator * Replace '\_' by '-' -### Preprocessing +### Preprocessing The data from supplementalMetadata is (logically) preprocessed as follows. @@ -3573,14 +3573,14 @@ So using the examples above, we get the following order: | {R={CA}} | 1 | n/a | | -### Processing LanguageIds +### Processing LanguageIds To canonicalize a given _source_: 1. Canonicalize the syntax of _source_ as per _Definition 5. Canonicalizing Syntax_. 2. Where the _source_ could be an arbitrary BCP 47 language tag, first process as follows: 1. If the source is identical to one of the types in the BCP47 LegacyRules, replace the entire source by the replacement value. - 2. Else if there is an extlang subtag, then apply Step 3 of BCP 47 [Section 4.5](https://tools.ietf.org/search/bcp47#section-4.5) to remove the extlang subtag (possibly adjusting the language subtag). + 2. Else if there is an extlang subtag, then apply Step 3 of BCP 47 [Section 4.5](https://www.rfc-editor.org/rfc/rfc5646.html#section-4.5) to remove the extlang subtag (possibly adjusting the language subtag). 1. Don’t apply any of the other canonicalization steps in that section, however. 3. Else if the first subtag is "x", prefix by "und-". 4. **Note:** there are currently no valid 4-letter primary language subtags. While it is extremely unlikely that BCP 47 would ever register them, if so then _languageAlias_ mappings will be supplied for them, mapping to defined CLDR language subtags (from the `idStatus="reserved"` set). @@ -3589,7 +3589,7 @@ To canonicalize a given _source_: 4. Transform _source_ according to that rule 5. loop (goto #3) -## Processing LocaleIds +### Processing LocaleIds The canonicalization of localeIds is done by first canonicalizing the languageId portion, then handling extensions in the following way: @@ -3605,7 +3605,7 @@ The canonicalization of localeIds is done by first canonicalizing the languageId 2. We get the following transformation: `en-u-rg-fi01 ⇒ en-u-rg-axzzzz` -## Optimizations +### Optimizations The above algorithm is a logical statement of the process, but would obviously not be directly suited to production code. Production-level code can use many optimizations for efficiency while achieving the same result. For example, the Alias Rules can be further preprocessed to avoid indefinite looping, instead doing a rule lookup once per subtag. As another example, the small number of **Territory Exceptions** can be preprocessed to avoid the likely subtags processing. @@ -3631,9 +3631,9 @@ The above algorithm is a logical statement of the process, but would obviously n | [BCP47] | [https://www.rfc-editor.org/rfc/bcp/bcp47.txt](https://www.rfc-editor.org/rfc/bcp/bcp47.txt)
The Registry
[https://www.iana.org/assignments/language-subtag-registry](https://www.iana.org/assignments/language-subtag-registry) | | [ISO639] | ISO Language Codes
[https://www.loc.gov/standards/iso639-2/](https://www.loc.gov/standards/iso639-2/)
Actual List
[https://www.loc.gov/standards/iso639-2/langcodes.html](https://www.loc.gov/standards/iso639-2/langcodes.html) | | [ISO1000] | ISO 1000: SI units and recommendations for the use of their multiples and of certain other units, International Organization for Standardization, 1992.
[https://www.iso.org/iso/catalogue_detail?csnumber=5448](https://www.iso.org/iso/catalogue_detail?csnumber=5448) | -| [ISO3166] | ISO Region Codes
[https://www.iso.org/iso/country_codes](https://www.iso.org/iso/country_codes)
Actual List
[https://www.iso.org/obp/ui/#search](https://www.iso.org/obp/ui/#search) | -| [ISO4217] | ISO Currency Codes
[https://www.iso.org/iso/home/standards/currency_codes.htm](https://www.iso.org/iso/home/standards/currency_codes.htm)
_(Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency information available.)_ | -| [ISO8601] | ISO Date and Time Format
[https://www.iso.org/iso/iso8601](https://www.iso.org/iso/iso8601) | +| [ISO3166] | ISO Region Codes
[https://www.iso.org/iso-3166-country-codes.html](https://www.iso.org/iso-3166-country-codes.html)
Actual List
[https://www.iso.org/obp/ui/#search](https://www.iso.org/obp/ui/#search) | +| [ISO4217] | ISO Currency Codes
[https://www.iso.org/iso-4217-currency-codes.html](https://www.iso.org/iso-4217-currency-codes.html)
_(Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency information available.)_ | +| [ISO8601] | ISO Date and Time Format
[https://www.iso.org/iso-8601-date-and-time-format.html](https://www.iso.org/iso-8601-date-and-time-format.html) | | [ISO15924] | ISO Script Codes
[https://www.unicode.org/iso15924/index.html](https://www.unicode.org/iso15924/index.html)
Actual List
[https://www.unicode.org/iso15924/codelists.html](https://www.unicode.org/iso15924/codelists.html) | | [LOCODE] | United Nations Code for Trade and Transport Locations, commonly known as "UN/LOCODE"
[https://unece.org/trade/uncefact/unlocode](https://unece.org/trade/uncefact/unlocode)
Download at: [https://unece.org/trade/cefact/UNLOCODE-Download](https://unece.org/trade/cefact/UNLOCODE-Download) | | [RFC6067] | BCP 47 Extension U
[https://www.ietf.org/rfc/rfc6067.txt](https://www.ietf.org/rfc/rfc6067.txt) | @@ -3655,10 +3655,10 @@ The above algorithm is a logical statement of the process, but would obviously n | [LocaleProject] | Common Locale Data Repository Project
[https://cldr.unicode.org](https://cldr.unicode.org) | | [NamingGuideline] | OpenI18N Locale Naming Guideline
formerly at https://www.openi18n.org/docs/text/LocNameGuide-V10.txt | | [RBNF] | Rule-Based Number Format
[https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1RuleBasedNumberFormat.html](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1RuleBasedNumberFormat.html) | -| [RBBI] | Rule-Based Break Iterator
[https://unicode-org.github.io/icu/userguide/boundaryanalysis](https://unicode-org.github.io/icu/userguide/boundaryanalysis) | +| [RBBI] | Rule-Based Break Iterator
[https://unicode-org.github.io/icu/userguide/boundaryanalysis/](https://unicode-org.github.io/icu/userguide/boundaryanalysis/) | | [UCAChart] | Collation Chart[
https://www.unicode.org/charts/collation/](https://www.unicode.org/charts/collation/) | -| [UTCInfo] | NIST Time and Frequency Division Home Page
[https://tf.nist.gov/
](https://tf.nist.gov/)U.S. Naval Observatory: What is Universal Time?
| -| [WindowsCulture] | Windows Culture Info (with mappings from [[BCP47](#BCP47)]-style codes to LCIDs)
[https://docs.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo](https://docs.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo) | +| [UTCInfo] | NIST Time and Frequency Division Home Page
[https://www.nist.gov/pml/time-and-frequency-division
](https://www.nist.gov/pml/time-and-frequency-division)U.S. Naval Observatory: What is Universal Time?
| +| [WindowsCulture] | Windows Culture Info (with mappings from [[BCP47](#BCP47)]-style codes to LCIDs)
[https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo?view=net-6.0](https://learn.microsoft.com/en-us/dotnet/api/system.globalization.cultureinfo?view=net-6.0) | ## Acknowledgments