-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-18187 Add complex segmentation to scriptMetadata.txt #4262
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure looks good. The file is actually created using a tool as described on https://cldr.unicode.org/development/updating-codes/updating-script-metadata.
I'll add you to the writers on the sheet, and the PR will need to also modify GenerateScriptMetadata.java*
*When we do that, we need to add to the header of the .txt file that it is generated according to https://cldr.unicode.org/development/updating-codes/updating-script-metadata
100% on the value of the data. Is it time to move this to an XML document though perhaps in supplemental? Could still output it as a .txt for release. |
OK, I added it to the Java file and the Google Sheets. While doing this, I realized, is this data meaningfully different from the column "LB letters"? CC @markusicu |
We could potentially add a third value to the enumeration in the LB Letters column to distinguish scripts like Thai, which need a dictionary for word and line segmentation, from Han, which needs a dictionary for only word segmentation. |
Idea: Consider changing LBLetters(Hani) to "No" but adding WBLetters and making that "Yes" for Hani. |
Good idea! |
I think Shane's idea is a bit simpler. The question is whether we know of
any APIs that reflect the value as a boolean; when they read the data they
would need to make a code change.
…On Wed, Jan 8, 2025 at 4:10 PM Markus Scherer ***@***.***> wrote:
Idea: Consider changing LBLetters(Hani) to "No" but adding WBLetters and
making that "Yes" for Hani.
—
Reply to this email directly, view it on GitHub
<#4262 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMG7M5G42YYIC6HQ6TL2JW5ALAVCNFSM6AAAAABUWHSBTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYHEZDINZRHA>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
I kind-of like a new column because (1) it doesn't break users of the old column and (2) it would potentially allow for scripts that need special rules for line break but not for word break (say, line break allowed on syllable boundaries). |
How is that derivation actually done? Depending on how you interpret between letters, the values in this file look wrong (or at the very least inconsistent) for all but one of the scripts that use the Brahmic style of line breaking (see https://www.unicode.org/reports/tr14/#BreakOpportunities).
|
Bali, Java, Hatr, and Elym have comments in the spreadsheet saying that they might be wrong. But, if we go by that description of the column, I would expect Thai to be "NO" because Thai should have line-breaks at word boundaries. I've seen bugs before where the break engine found breaks in the middle of words and it was wrong. |
Shane: The description of LB letters doesn't reference *word breaks* at
all. It is just a question of whether you can get line breaks between two
characters XY, where X and Y are letters of that script.
Robin: The spreadsheet data for that column isn't derived, and probably
predates https://www.unicode.org/reports/tr14/#LB28a. Ideally the data
would be maintained in the UCD, but the UTC didn't want to have script
metadata when the subject was raised (ages ago). If it were, we could have
invariant tests for that.
…On Wed, Jan 8, 2025 at 4:46 PM Shane F. Carr ***@***.***> wrote:
Bali, Java, Hatr, and Elym have comments in the spreadsheet
<https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit?gid=0#gid=0>
saying that they might be wrong.
But, if we go by that description of the column, I would expect Thai to be
"NO" because Thai should have line-breaks at word boundaries. I've seen
bugs before where the break engine found breaks in the middle of words and
it was wrong.
—
Reply to this email directly, view it on GitHub
<#4262 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMFXCRKTICFUPU7IFND2JXBF5AVCNFSM6AAAAABUWHSBTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYHE3DINJUG4>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
CLDR-18187
CC @eggrobin @makotokato @Manishearth
See ticket for details. The issue discussed in multiple CLDR Design WG meetings, but this specific solution was not.
ALLOW_MANY_COMMITS=true