Skip to content

Conversation

@Jules-Bertholet
Copy link
Contributor

UAX 14:

As originally defined, the line break class AI contained all characters with East_Asian_Width value A (ambiguous width) that would otherwise be AL in this classification. For more information on East_Asian_Width and how to resolve it, see Unicode Standard Annex #11, East Asian Width [UAX11].

The original definition included many Latin, Greek, and Cyrillic characters. These characters are now classified by default as AL because use of the AL line breaking class better corresponds to modern practice. Where strict compatibility with older legacy implementations is desired, some of these characters need to be treated as ID in certain contexts. This can be done by always tailoring them to ID or by continuing to classify them as AI and resolving them to ID where required.

As part of the same revision, the set of ambiguous characters has been extended to completely encompass the enclosed alphanumeric characters used for numbering of bullets.

As updated, the AI line breaking class includes all characters with East Asian Width A that are outside the range U+0000..U+1FFF, plus the following characters:

24EA CIRCLED DIGIT ZERO
2780..2793 DINGBAT CIRCLED SANS-SERIF DIGIT ONE..DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN

@Manishearth Manishearth merged commit e77b292 into unicode-rs:master Nov 1, 2024
2 checks passed
@Jules-Bertholet Jules-Bertholet deleted the ambiguous-line-break branch November 1, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants