Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE #252

Open
harjitmoe opened this issue Feb 15, 2021 · 6 comments
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@harjitmoe
Copy link
Contributor

https://encoding.spec.whatwg.org/commit-snapshots/4d54adce6a871cb03af3a919cbf644a43c22301a/#visualization

Let index be index Big5 excluding all entries whose pointer is less than (0xA1 - 0x81) × 157.

Avoid returning Hong Kong Supplementary Character Set extensions literally.

As become apparent in my attempts to chart different Big5 and CNS 11643 variants: if the intention is to make the encoder purely Big5-ETEN, excluding all further extensions that Big5-HKSCS adds on top of it, then lead bytes 0xFA–FE need to be excluded, not just 0x81–A0.

The only-partial exclusion of HKSCS in the encoder defined by the current standard actually creates some truly bizarre corner cases, insofar as how it interacts with index-big5's inclusion of the duplicate mappings inherited from GCCS (which a lot of even HKSCS-equipped Big5 codecs, e.g. Python's big5-hkscs, do not accept).  Some of these duplicated other GCCS/HKSCS codes, rather than standard Big5 codes.  In four cases, one of these GCCS duplicates has a lead byte in 0xFA–FE, while its standard HKSCS code has a lead byte in 0x81–A0.  Hence, the WHATWG-described behaviour finishes up decoding them from both, but encoding them to their GCCS duplicates as follows.

0x9DEF → 嘅 U+5605 ↔ 0xFB48
0x9DFB → 廐 U+5ED0 ↔ 0xFBF9
0xA0DC → 悤 U+60A4 ↔ 0xFC6C
0x9975 → 猪 U+732A ↔ 0xFE52

Accepting these GCCS duplicates is probably fine, but generating them (when not even all HKSCS-equipped implementations will accept them) is probably inappropriate, even assuming (for sake of argument) that the encoder's current partway-house between Big5-ETEN and Big5-HKSCS was deliberately chosen.

@annevk annevk added i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Feb 16, 2021
@annevk
Copy link
Member

annevk commented Feb 16, 2021

Thank you for reporting this!

@foolip @hsivonen @ricea thoughts? Assuming this is correct, while there is some risk in changing the encoder, it's usually fairly minimal, right?

@hsivonen
Copy link
Member

Seems minimal-risk, yes. Indeed, the range above the original Big5 range has been mentioned as questionable to include in the encoder before when the range below is excluded.

@ricea
Copy link
Collaborator

ricea commented Feb 16, 2021

I'm confused since according to the note:

There are other duplicate code points, but for those the first pointer is to be used.

we should not be returning the 0xFxxx duplicates anyway. What am I misunderstanding?

@annevk
Copy link
Member

annevk commented Feb 16, 2021

Step 1 of https://encoding.spec.whatwg.org/#index-big5-pointer excludes some of them.

@ricea
Copy link
Collaborator

ricea commented Feb 16, 2021

Okay, I get it now. This change seems reasonable I think, but it won't be a high priority for Chrome.

@annevk
Copy link
Member

annevk commented Feb 16, 2021

When I run

import json

data = json.load(open("indexes.json", "r"))

big5 = data["big5"]

code_points = {}
pointer = 0
for code_point in big5:
    if code_point != None:
        if code_point not in code_points:
            code_points[code_point] = [pointer]
        else:
            code_points[code_point].append(pointer)
    pointer += 1

for code_point in code_points:
    pointers = code_points[code_point]
    if len(pointers) > 1: # It's either 1 or 2
        excluded = "no"
        if pointers[0] < 5024 and pointers[1] < 5024:
            excluded = "yes"
        elif pointers[0] < 5024 or pointers[1] < 5024:
            excluded = "partial"

        print("U+" + hex(code_point).upper()[2:], pointers, excluded)

it seems we have many other pointers for duplicates we probably want to keep excluding? If so, the fix here would likely be to special case the code points listed in OP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Development

No branches or pull requests

4 participants