Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0xA3 0xA0 in GB 18030 #338

Open
xfq opened this issue Nov 13, 2024 · 9 comments
Open

0xA3 0xA0 in GB 18030 #338

xfq opened this issue Nov 13, 2024 · 9 comments
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@xfq
Copy link

xfq commented Nov 13, 2024

What is the issue with the Encoding Standard?

https://encoding.spec.whatwg.org/#gb18030-encoder

Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for compatibility with deployed content. Therefore it cannot roundtrip.

We didn't update this in #336 , so I filed this issue to track it.

https://bugzilla.mozilla.org/show_bug.cgi?id=131837 , a bug filed in 2002 mentioned this. The reason behind this mapping was that some websites use 0xA3 0xA0 as space characters, which causes display abnormalities, so Mozilla changed the mapping to U+3000 IDEOGRAPHIC SPACE.

In the Hong Kong Supplementary Character Set, U+E5E5 was used to encode 𨪜 (U+2A89C in Unicode CJK Unified Ideographs Extension I).

We need to analyze how many websites using GB 18030 are still using 0xA3 0xA0 to represent U+3000.

Currently, iconv and ICU seems to map 0xA3 0xA0 to U+E5E5. ICU 74.1+ maps it to U+3000.


The following is some information about this misuse (mostly translated from a Chinese website).

The 0xA3A1 ~ 0xA3FE part of GB18030-2022 is inherited from row 3 of GB 2312, and contains the G0 set of GB/T 1988-80 (ISO 646-CN). GB 2312 does not specify the width of these characters, but subsequent standards (such as GB 5007.1-85) made it clear that characters in row 3 are full-width, which are mapped to the Halfwidth and Fullwidth Forms Unicode block.

However, the G0 set of GB/T 1988-80 does not include spaces, but influenced by ASCII, people often consider spaces together with the remaining 94 characters. Now let's assume that someone thinks that 0xA3A1 ~ 0xA3FE are full-width ASCII characters (although "$" has been replaced by "¥"), then this person is likely to think that 0xA3 0xA0 should be a full-width space (although the actual full-width space is at 0xA1A1). Because some fonts display .notdef as a 1 em wide space, even when the corresponding Unicode code point of the two are different, the rendering is the same (undefined PUA code points in GB encoding will be displayed as .notdef).

@xfq xfq added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-clreq Notifies Chinese script experts of relevant issues labels Nov 13, 2024
@xfq
Copy link
Author

xfq commented Nov 13, 2024

Some preliminary analysis:

Usage statistics of GB 18030, GBK, and GB 2312 for websites

See:

Bing search engine results

Searching [] (U+E5E5) in Bing China, most of the web pages were in 200x. Some websites may quote text from other websites, resulting in some recent results:

Website Results Latest year in the first page Bing International
sina.com.cn 1,730,000 2014
Sohu.com 778,000 2024
qq.com 78,700 2021
Sogou.com 67,900 2021
Zol.com.cn 42,300 2024
Jjwxc.net 39,000 2010
people.com.cn 27 2022
chinanews.com.cn 14
Pconline.com.cn 11 2024
china.com 10 2024
jd.com 7 2024
Hexun.com 3 2003
alipay.com 0 75,000

Examples

https://www.chinanews.com.cn/n/2004-01-13/26/391173.html (GB 2312)

image

This website uses both U+3000 and U+E5E5.

https://blog.sina.com.cn/s/blog_44c67f2c0102v4at.html (UTF-8)

image

This website uses both U+3000 and U+E5E5.

https://news.sina.com.cn/c/2004-01-21/09272687351.shtml (GB 2312)

image

This website only uses U+3000.

https://edu.sina.com.cn/focus/wq3/index.html (GB2312, date 2008)

image

Although it's a U+E5E5 result, the source code only contains U+3000.

@hsivonen
Copy link
Member

Index gb18030 maps 0xA3 0xA0 to U+3000 rather than U+E5E5 for compatibility with deployed content. Therefore it cannot roundtrip.

We didn't update this in #336 , so I filed this issue to track it.

We've had the current behavior for years and years. What practical problem does the current state of things cause?

Changing it would affect the relationship of GBK and GB18030 in the Encoding Standard, right?

@xfq
Copy link
Author

xfq commented Nov 13, 2024

We've had the current behavior for years and years. What practical problem does the current state of things cause?

Because GB 18030 is a compulsory standard, according to Article 14 of CHAPTER III of the Standardization Law of the People's Republic of China:

Compulsory standards must be complied with. It shall be prohibited to produce, sell or import products that are not up to the compulsory standards.

IANAL, but non-conformance to GB 18030 could be seen as a risk.

@vyv03354
Copy link
Collaborator

We already diverged from GB18030-2022 spec because of the non-round trip mapping proposed by UTC. There is no point in changing 0xA3 0xA0 mapping from the spec conformance perspective.

@annevk
Copy link
Member

annevk commented Nov 13, 2024

What supports your claim about ICU? Did https://unicode-org.atlassian.net/browse/ICU-22420 get reverted?

I created web-platform-tests/wpt#49137 to ensure we test this code point. I haven't seen any credible argument in this thread to change this mapping so I'm inclined to close this.

@xfq
Copy link
Author

xfq commented Nov 14, 2024

What supports your claim about ICU? Did https://unicode-org.atlassian.net/browse/ICU-22420 get reverted?

Sorry, I'm not using the latest version of ICU, you are right. I updated the description above.

I think before closing this issue, at least we should analyze the impact of updating and not updating the mapping.

@annevk
Copy link
Member

annevk commented Nov 14, 2024

I don't think you have convinced anyone thus far that we collectively need to do that. You are certainly welcome to perform such an analysis of course, but given all browsers agree it's extremely unlikely that making a change here would improve anything.

@aphillips
Copy link
Contributor

I was actioned by I18N with responding to this issue, which we discussed in our 2024-11-21 call.

As near as I can tell, the code unit sequence 0xA3 0xA0 is not actually assigned in GB18030. A look at CJKV Information Processing suggests that the code space for two-byte sequences does not use the bytes 0xA0 and 0xFF. I do not have a copy of GB18030 handy to look at myself and don't have any direct experience implementing this encoding.

My local encoders (Oracle JVM 23.0.1 and ICU4J v76.1) produce U+E5E5 for this byte sequence. The reverse (encoding U+E5E5 to GB18030-2022) produces 0xA3 0xA0. I reproduce my code below, in case this is useful. I did not test ICU4C.

I've written to Ken Lunde to ask his advice. I do think U+3000 is a tiny bit weird, although the logical character before ! in ASCII is SPACE, so the logical character before U+FF01 (the full-width ) might be IDEOGRAPHIC SPACE??

I do not think that this represents a critical problem, since no data should exist in GB18303 that uses this byte sequence for anything meaningful. Replacing the sequence with one character or another should produce no meaningful difference, unless I'm not understanding something. But past experience with sequences in a legacy encoding producing different results in different coders have generally been that this becomes a problem at a later date. In this case, I don't think any graphical character will ever be assigned to this specific sequence, so it probably makes no difference.

My code:

    public static void encoding338() {
        try {
            Charset gb18030 = Charset.forName("GB18030-2022");
            CharsetDecoder decoder = gb18030.newDecoder();
            ByteBuffer bb = ByteBuffer.wrap(new byte[] { (byte) 0xA3, (byte) 0xA0 });
            CharBuffer cb = decoder.decode(bb);
            System.out.println(Util.native2ascii(cb.toString())); // not standard code but does what you think it does
            
            Charset icu = CharsetICU.forNameICU("GB18030-2022");
            decoder = gb18030.newDecoder();
            bb = ByteBuffer.wrap(new byte[] { (byte) 0xA3, (byte) 0xA0 });
            cb = decoder.decode(bb);
            System.out.println(Util.native2ascii(cb.toString()));

            ByteBuffer out = gb18030.encode(cb);
            byte[] bytes = out.array();
            for (byte b : bytes) {
                System.out.print(Integer.toHexString((int) (b &0xFF)));
                System.out.print(' ');
            }
            System.out.println();
            cb.rewind();
            out = icu.encode(cb);
            bytes = out.array();
            for (byte b : bytes) {
                System.out.print(Integer.toHexString((int) (b &0xFF)));
                System.out.print(' ');
            }
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }

Produces:

\ue5e5
\ue5e5
a3 a0 0 0 
a3 a0 0 0 

@vyv03354
Copy link
Collaborator

vyv03354 commented Nov 21, 2024

Although I don't agree with OP's request, I have to correct the factual error.

As near as I can tell, the code unit sequence 0xA3 0xA0 is not actually assigned in GB18030.

Yes, the code unit sequence 0xA3 0xA0 is assigned in GB18030.

A look at CJKV Information Processing

Why don't you read the spec itself instead of a secondary source?

I do not have a copy of GB18030

🤔

Here is a quote from the GB18030-2022 spec.
image

From the spec compliance perspective, 0xA3 0xA0 must be mapped to U+E5E5, period. We are intentionally violating the spec for web-compat.

I do not think that this represents a critical problem, since no data should exist in GB18303 that uses this byte sequence for anything meaningful.

Even one character mapping change makes GB18030 not a UTF. It breaks round-trip conversion.

I don't think any graphical character will ever be assigned to this specific sequence, so it probably makes no difference.

Yes, it makes a difference. U+E5E5 will render a white box (on Chromium) or a hexbox (on Firefox) and it will be a visual glitch if the page author intended to use an IDEPGRAPHIC SPACE (U+3000). This is the very reason we violated the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Development

No branches or pull requests

5 participants