Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emoji sort order in the DUCET #773

Open
markusicu opened this issue Apr 8, 2024 · 3 comments
Open

emoji sort order in the DUCET #773

markusicu opened this issue Apr 8, 2024 · 3 comments
Assignees
Labels

Comments

@markusicu
Copy link
Member

markusicu commented Apr 8, 2024

Goals for this issue:

  1. Agree on whether to move the UTS51 emoji sort order into the DUCET
  2. If agreed: Figure out how to do it

The DUCET could in principle sort symbols arbitrarily, for example by code point. However, it defines a bespoke sort order:
https://www.unicode.org/charts/collation/chart_General-Symbol.html

The DUCET sort order of emoji generally does not group similar emoji together unless they have adjacent code points.

At least one Unicode member organization has bug reports about the sort order of emoji.

UTS51 has long defined a grouping and sort order for emoji:

CLDR has long included a collation tailoring for this (see above), but it is hard to use.

  • When someone wants a language-specific sort order of letters but also the emoji sort order of symbols, then they have to combine the two tailoring rules and build a new collator at runtime.
  • Building a collator at runtime is expensive (time & memory).
  • Not everyone has access to the tailoring rule strings (because they are large, and rarely used).
  • Not everyone has access to the collation tailoring builder code (because it is complex and large).

CLDR has ticket CLDR-10745 “Merge emoji into CLDR root”. If the emoji sort order were built into the default sort order, then it would be always available.

We want the DUCET and CLDR root default sort orders to be the same.

If we agree to move the UTS51 emoji sort order into both default sort orders, then the cleanest way to do so is to modify the DUCET input data file, together with modifying the code that parses this file and outputs the actual sort order file so that it can handle whatever we need for this that it does not already handle.

@markusicu markusicu added the uca label Apr 8, 2024
@markusicu
Copy link
Member Author

Discussion in CLDR/ICU design meeting 20240506:

Options

  1. Sort emoji as in the ESR ordering in root=DUCET
    1. Concern: The ordering has many contractions which could noticeably increase the size of the root collation data
  2. Sort emoji as in the ESR ordering in an optional variant of root=DUCET
    1. Goal: Mitigate size increase of root for people who are more sensitive to size than emoji order
    2. Orthogonal to the existing optional variant of unihan (implicit vs. radical-stroke)
    3. Would require either a build-time switch that also builds potentially different tailoring binaries for different root tables, or engineering to make sure that pre-built tailorings work with any of the root variants (e.g., affecting disjoint sets of characters)
  3. Sort emoji in root=DUCET with a space-optimized sort order (eg singer sorts as “person” followed by “microphone”)
    1. Work to create a new, compromise sort order with fewer contractions, then get that into root=DUCET
  4. No change – keep it in a hard-to-use tailoring

@markusicu markusicu self-assigned this May 6, 2024
@markusicu
Copy link
Member Author

TODO(markus): Talk with ESR, see if it would be acceptable to use a simplified emoji sort order without the << distinctions in order to move this into the DUCET? This would remove concerns about data size, and it would likely avoid problems with the sifter tool that generates the DUCET data.

@markusicu
Copy link
Member Author

markusicu commented May 10, 2024

Possible simplified sort order.
Copied from https://github.com/unicode-org/cldr/blob/main/common/collation/root.xml#L953
and then removed contractions for most of the ZWJ sequences, and expansions for people-holding-hands.
I kept the keycaps and flags.

I did this manually, for discussion, so it may not be 100% right.
And I am not sure about some contractions that make emoji with U+FE0F VARIATION SELECTOR-16 (VS16) sort the same as those without. My hacking is probably inconsistent there.

In the end, I also kept some ZWJ sequence contractions for things like lime, broken link, etc., assuming that we can support a small-ish number of them. (Will need some work in the sifter tool.)

Once we agree on an approach, we will need to modify the generator code and get the real thing.
(And once we agree on that, we need to get it into DUCET input format (unidata.txt.)

For trying this out, either build an ICU RuleBasedCollator for the rules, or paste them into the "Append rules" box of the ICU Collation Demo.

& [last primary ignorable]<<*🦰🦱🦳🦲🏻🏼🏽🏾🏿
& [before 1]\uFDD1€
<*😀😃😄😁😆😅🤣😂🙂🙃🫠😉😊😇
<*🥰😍🤩😘😗☺😚😙🥲
<*😋😛😜🤪😝🤑
<*🤗🤭🫢🫣🤫🤔🫡
<*🤐🤨😐😑😶🫥
< 😶‍🌫
<*😏😒🙄😬
< 😮‍💨
<*🤥🫨
< 🙂‍↔
< 🙂‍↕
<*😌😔😪🤤😴
<*😷🤒🤕🤢🤮🤧🥵🥶🥴😵
< 😵‍💫
<*🤯
<*🤠🥳🥸
<*😎🤓🧐
<*😕🫤😟🙁☹😮😯😲😳🥺🥹😦😧😨😰😥😢😭😱😖😣😞😓😩😫🥱
<*😤😡😠🤬😈👿💀☠
<*💩🤡👹👺👻👽👾🤖
<*😺😸😹😻😼😽🙀😿😾
<*🙈🙉🙊
<*💌💘💝💖💗💓💞💕💟❣💔
< ❤‍🔥 = ❤️‍🔥
< ❤‍🩹 = ❤️‍🩹
<*❤🩷🧡💛💚💙🩵💜🤎🖤🩶🤍
<*💋💯💢💥💫💦💨🕳💬
< 👁‍🗨 = 👁️‍🗨
<*🗨🗯💭💤
<*👋🤚🖐✋🖖🫱🫲🫳🫴🫷🫸
<*👌🤌🤏✌🤞🫰🤟🤘🤙
<*👈👉👆🖕👇☝🫵
<*👍👎✊👊🤛🤜
<*👏🙌🫶👐🤲🤝🙏
<*✍💅🤳
<*💪🦾🦿🦵🦶👂🦻👃🧠🫀🫁🦷🦴👀👁👅👄🫦
<*👶🧒👦👧🧑👱👨🧔
<*👩
<*🧓👴👵
<*🙍
<*🙎
<*🙅
<*🙆
<*💁
<*🙋
<*🧏
<*🙇
<*🤦
<*🤷
<*👮
<*🕵
<*💂
<*🥷👷
<*🫅🤴👸👳
<*👲🧕🤵
<*👰
<*🤰🫃🫄🤱
<*👼🎅🤶
<*🦸
<*🦹
<*🧙
<*🧚
<*🧛
<*🧜
<*🧝
<*🧞
<*🧟
<*🧌
<*💆
<*💇
<*🚶
<*🧍
<*🧎
<*🏃
<*💃🕺🕴👯
<*🧖
<*🧗
<*🤺🏇⛷🏂🏌
<*🏄
<*🚣
<*🏊
<*⛹
<*🏋
<*🚴
<*🚵
<*🤸
<*🤼
<*🤽
<*🤾
<*🤹
<*🧘
<*🛀🛌
<*💏
<*💑
<*🗣👤👥🫂👪
<*👣
<*🦰🦱🦳🦲
<*🐵🐒🦍🦧🐶🐕🦮
< 🐕‍🦺
<*🐩🐺🦊🦝🐱🐈
< 🐈‍⬛
<*🦁🐯🐅🐆🐴🫎🫏🐎🦄🦓🦌🦬🐮🐂🐃🐄🐷🐖🐗🐽🐏🐑🐐🐪🐫🦙🦒🐘🦣🦏🦛🐭🐁🐀🐹🐰🐇🐿🦫🦔🦇🐻
< 🐻‍❄
<*🐨🐼🦥🦦🦨🦘🦡🐾
<*🦃🐔🐓🐣🐤🐥🐦🐧🕊🦅🦆🦢🦉🦤🪶🦩🦚🦜🪽
< 🐦‍⬛
<*🪿
< 🐦‍🔥
<*🐸
<*🐊🐢🦎🐍🐲🐉🦕🦖
<*🐳🐋🐬🦭🐟🐠🐡🦈🐙🐚🪸🪼
<*🐌🦋🐛🐜🐝🪲🐞🦗🪳🕷🕸🦂🦟🪰🪱🦠
<*💐🌸💮🪷🏵🌹🥀🌺🌻🌼🌷🪻
<*🌱🪴🌲🌳🌴🌵🌾🌿☘🍀🍁🍂🍃🪹🪺🍄
<*🍇🍈🍉🍊🍋
< 🍋‍🟩
<*🍌🍍🥭🍎🍏🍐🍑🍒🍓🫐🥝🍅🫒🥥
<*🥑🍆🥔🥕🌽🌶🫑🥒🥬🥦🧄🧅🥜🫘🌰🫚🫛
< 🍄‍🟫
<*🍞🥐🥖🫓🥨🥯🥞🧇🧀🍖🍗🥩🥓🍔🍟🍕🌭🥪🌮🌯🫔🥙🧆🥚🍳🥘🍲🫕🥣🥗🍿🧈🧂🥫
<*🍱🍘🍙🍚🍛🍜🍝🍠🍢🍣🍤🍥🥮🍡🥟🥠🥡
<*🦀🦞🦐🦑🦪
<*🍦🍧🍨🍩🍪🎂🍰🧁🥧🍫🍬🍭🍮🍯
<*🍼🥛☕🫖🍵🍶🍾🍷🍸🍹🍺🍻🥂🥃🫗🥤🧋🧃🧉🧊
<*🥢🍽🍴🥄🔪🫙🏺
<*🌍🌎🌏🌐🗺🗾🧭
<*🏔⛰🌋🗻🏕🏖🏜🏝🏞
<*🏟🏛🏗🧱🪨🪵🛖🏘🏚🏠🏡🏢🏣🏤🏥🏦🏨🏩🏪🏫🏬🏭🏯🏰💒🗼🗽
<*⛪🕌🛕🕍⛩🕋
<*⛲⛺🌁🌃🏙🌄🌅🌆🌇🌉♨🎠🛝🎡🎢💈🎪
<*🚂🚃🚄🚅🚆🚇🚈🚉🚊🚝🚞🚋🚌🚍🚎🚐🚑🚒🚓🚔🚕🚖🚗🚘🚙🛻🚚🚛🚜🏎🏍🛵🦽🦼🛺🚲🛴🛹🛼🚏🛣🛤🛢⛽🛞🚨🚥🚦🛑🚧
<*⚓🛟⛵🛶🚤🛳⛴🛥🚢
<*✈🛩🛫🛬🪂💺🚁🚟🚠🚡🛰🚀🛸
<*🛎🧳
<*⌛⏳⌚⏰⏱⏲🕰🕛🕧🕐🕜🕑🕝🕒🕞🕓🕟🕔🕠🕕🕡🕖🕢🕗🕣🕘🕤🕙🕥🕚🕦
<*🌑🌒🌓🌔🌕🌖🌗🌘🌙🌚🌛🌜🌡☀🌝🌞🪐⭐🌟🌠🌌☁⛅⛈🌤🌥🌦🌧🌨🌩🌪🌫🌬🌀🌈🌂☂☔⛱⚡❄☃⛄☄🔥💧🌊
<*🎃🎄🎆🎇🧨✨🎈🎉🎊🎋🎍🎎🎏🎐🎑🧧🎀🎁🎗🎟🎫
<*🎖🏆🏅🥇🥈🥉
<*⚽⚾🥎🏀🏐🏈🏉🎾🥏🎳🏏🏑🏒🥍🏓🏸🥊🥋🥅⛳⛸🎣🤿🎽🎿🛷🥌
<*🎯🪀🪁🔫🎱🔮🪄🎮🕹🎰🎲🧩🧸🪅🪩🪆♠♥♦♣♟🃏🀄🎴
<*🎭🖼🎨🧵🪡🧶🪢
<*👓🕶🥽🥼🦺👔👕👖🧣🧤🧥🧦👗👘🥻🩱🩲🩳👙👚🪭👛👜👝🛍🎒🩴👞👟🥾🥿👠👡🩰👢🪮👑👒🎩🎓🧢🪖⛑📿💄💍💎
<*🔇🔈🔉🔊📢📣📯🔔🔕
<*🎼🎵🎶🎙🎚🎛🎤🎧📻
<*🎷🪗🎸🎹🎺🎻🪕🥁🪘🪇🪈
<*📱📲☎📞📟📠
<*🔋🪫🔌💻🖥🖨⌨🖱🖲💽💾💿📀🧮
<*🎥🎞📽🎬📺📷📸📹📼🔍🔎🕯💡🔦🏮🪔
<*📔📕📖📗📘📙📚📓📒📃📜📄📰🗞📑🔖🏷
<*💰🪙💴💵💶💷💸💳🧾💹
<*✉📧📨📩📤📥📦📫📪📬📭📮🗳
<*✏✒🖋🖊🖌🖍📝
<*💼📁📂🗂📅📆🗒🗓📇📈📉📊📋📌📍📎🖇📏📐✂🗃🗄🗑
<*🔒🔓🔏🔐🔑🗝
<*🔨🪓⛏⚒🛠🗡⚔💣🪃🏹🛡🪚🔧🪛🔩⚙🗜⚖🦯🔗
< ⛓‍💥 = ⛓️‍💥
<*⛓🪝🧰🧲🪜
<*⚗🧪🧫🧬🔬🔭📡
<*💉🩸💊🩹🩼🩺🩻
<*🚪🛗🪞🪟🛏🛋🪑🚽🪠🚿🛁🪤🪒🧴🧷🧹🧺🧻🪣🧼🫧🪥🧽🧯🛒
<*🚬⚰🪦⚱🧿🪬🗿🪧🪪
<*🏧🚮🚰♿🚹🚺🚻🚼🚾🛂🛃🛄🛅
<*⚠🚸⛔🚫🚳🚭🚯🚱🚷📵🔞☢☣
<*⬆↗➡↘⬇↙⬅↖↕↔↩↪⤴⤵🔃🔄🔙🔚🔛🔜🔝
<*🛐⚛🕉✡☸☯✝☦☪☮🕎🔯🪯
<*♈♉♊♋♌♍♎♏♐♑♒♓⛎
<*🔀🔁🔂▶⏩⏭⏯◀⏪⏮🔼⏫🔽⏬⏸⏹⏺⏏🎦🔅🔆📶🛜📳📴
<*♀♂⚧
<*✖➕➖➗🟰♾
<*‼⁉❓❔❕❗〰
<*💱💲
<*⚕♻⚜🔱📛🔰⭕✅☑✔❌❎➰➿〽✳✴❇©®™
< '#⃣' = '#️⃣'
< '*⃣' = '*️⃣'
< 0⃣ = 0️⃣
< 1⃣ = 1️⃣
< 2⃣ = 2️⃣
< 3⃣ = 3️⃣
< 4⃣ = 4️⃣
< 5⃣ = 5️⃣
< 6⃣ = 6️⃣
< 7⃣ = 7️⃣
< 8⃣ = 8️⃣
< 9⃣ = 9️⃣
<*🔟
<*🔠🔡🔢🔣🔤🅰🆎🅱🆑🆒🆓ℹ🆔Ⓜ🆕🆖🅾🆗🅿🆘🆙🆚🈁🈂🈷🈶🈯🉐🈹🈚🈲🉑🈸🈴🈳㊗㊙🈺🈵
<*🔴🟠🟡🟢🔵🟣🟤⚫⚪🟥🟧🟨🟩🟦🟪🟫⬛⬜◼◻◾◽▪▫🔶🔷🔸🔹🔺🔻💠🔘🔳🔲
<*🏁🚩🎌🏴🏳
< 🏳‍🌈 = 🏳️‍🌈
< 🏳‍⚧ = 🏳️‍⚧
< 🏴‍☠
<*🇦🇧🇨🇩🇪🇫🇬🇭🇮🇯🇰🇱🇲🇳🇴🇵🇶🇷🇸🇹🇺🇻🇼🇽🇾🇿
< 🏴󠁧󠁢󠁥󠁮󠁧󠁿
< 🏴󠁧󠁢󠁳󠁣󠁴󠁿
< 🏴󠁧󠁢󠁷󠁬󠁳󠁿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant