Skip to content
pannous edited this page Dec 25, 2023 · 5 revisions

Uniscript

Uniscript is a human readable and editable unicode encoding format which only uses ASCII characters to describe code points.

The constituents of uniscript are entities and block types.

Block types influencing the character stream would be

• languages (greek a => α) • modifiers (upper A => ᴬ , italic A => 𝐴 , bold A => 𝝖 bold+italic A => 𝘼 … ) • calligraphic hands (fracture A => 𝔄 , double-struck A => 𝔸 … ) • ligature (ligature ae => æ ) • colors (red circle ○ => 🔴, brown heart ♡ => 🤎) • mirroring (reverseInPlace e => ɘ ) • text direction (phonician a b c => 𐤂 𐤁 𐤀 ) • icons (iconic warning ⚠ => ⚠️ emoji-style U+FE0F ) • plain (undo all styles to ⚠️ => ⚠ 𝐴 => A text-style 0xFE0E )

Uniscript entities are case sensitive

upper a => ᵃ upper A => ᴬ

Representation

The textual representation of entities and blocks in Uniscript.

Simple entities can be represented as \: followed by the entity name:

\:infinity == ∞

The essential marker for the beginning of complex uniscript elements is "<:".

Enties are wrapped either in a single bracket of the form <:entity> or in a block of the form <:block> entities <:/block> For short sequences of entities there is an inline delineation <:type entities>

Examples

<:alpha> ⩵ α

<:fracture A> ⩵ 𝔄

<:fracture A b c > ⩵ 𝔄 𝔟 𝔠

<:fracture> A b c <:> ⩵ 𝔄 𝔟 𝔠

<:greek> a b c <:/greek> ⩵ α β ζ

Closing blocks

blocks are closed by repeating the opening type plus a slash:

<:greek> a b c <:/greek> ⩵ α β ζ

To support interoperability with xml/html the colon in <:/greek> must NOT be omitted!

Spaces

All spaces surrounding entities are only for visual appeal, are not part of the codepoint stream and will thus not be rendered in the resulting UTF-8 representation.

Fonts

Unicode and fonts have conceptual overlap in font faces such as bold and italic but there are also fonts rendereing a normal A as fracture 𝔄.

In an ideal world there would be a cleaner separation between unicode entities and visual variants. This is unfortunately out of scope. With a tiny chance uniscript would stop or even undo the proliferation of codepoint1 such as ♡ => 🤎 by adding colors as unicode control characters instead of arbitrarily combining a select number of entities with a select number of colors.

https://en.wikipedia.org/wiki/Unicode_control_characters

Likewise one might reinvestige clusters such ⚠ => ⚠️ and replace those surrogates with something cleaner. These visual aspects should really never have been put into unicode, whoever was responsible should be forced to undo these, or they should be boycotted in favor of a different approach.

Emoji-style U+FE0F control characters are fine in principle, but should be prefixed to the following character, not subfixed.

On the other hand

IDE support

IDEs may render these brackets beautifully as ⟨alpha⟩ ⩵ α ⟨fracture A⟩ ⩵ 𝔄 ⟪greek⟫ a b c ⟪/greek⟫ ⩵ α β ζ

WHY THOUGH?

List of block types:

⟪ligature⟫ ⟪fracture⟫

Entities

All entity mapping shall be defined in one human readable mapping file, which hopefully will one day evolve into a standard. Custom entity names may be defined in an extension file.

Block types versus entities

In general overlap between entity names and type names can be intentionally ambiguous yet yield the same result <:double> d <:> block type marker 'double' influencing all characters, in this case 'd' => 𝕕 <:double d> block type double or entity 'double d' ? Irrelevant for users, the result is '𝕕' <:double-d> one may write entity names unambiguously with hyphens.

uniscript names

Alternative names for uniscript considered but rejected (not ultimately?) were: unitext plaincode plain-code pluni-code plunicode.

Not to be confused with UTF-7.

Alternative format

An alternative format with the same concepts of entities and block types could be considered:

:alpha :fracture Hello :

Also revigorating and extending the HTML entity encoding format could be possible to encompass the comprehensive list of unicode codepoint entities with english names plus block type modifiers as declared above.

&ligature; ae &end-ligature;

There is no such thing as plaintext

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

So when using uniscript the encoding always needs to be explicit.

For example, in the future instead of tagging web pages with One might use .

Special remark

All texts containing "<:" as a character sequence not inteded as control signal need to encode it (similar to & within html entities).

One proper encoding of "<:" would be <:less>: or <:<> or <<:colon> or <<::>

Usually free standing "<" characters need NOT be encoded as <:less> because only the combination of "<:" forms a uniscript control signal. Likewise the character ">" NEVER needs to be encoded as <:greater> because ">" does not influence the unicode control flow except as closing entity/block marker AFTER the "<:" marker.

Since entity names are ascii only, there is no difficulty in parsing <:alpha> > <:beta> as α > β

Html entities

Entities are similar to HTML but use a different encoding <:alpha> vs α HTML entities with cryptic names (𝕕 𝕕 ) are supported for backwards compatibility but are strongly discouraged. Uniscript entities are much more comprehensive and all cryptic abbreviations have one ore more equivalent descriptive long english entity names. For example 𝕕 𝕕 has unicode entity name <:double d>

⟪ U+027EA ⟪ entity ⟫ U+027EB ⟫ entity ⟨ U+027E8 ⟨ entity ⟩ U+027E9 ⟩ entity

&fr; &fracture; &opf; &???; 𝕕 U+1D555 𝕕 entity

&DoubleType; 𝕕 ¨ ⇓ …

À U+000C0 À entity à U+000E0 à entity ã U+000E3 ã entity ≔ U+02254 ≔ entity * U+0002A * entity ∧ U+02227 ∧ entity ∠ U+02220 ∠ entity æ U+000E6 æ entity

ℵ U+02135 ℵ entity α U+003B1 α entity ∵ U+02235 ∵ entity

⨀ U+02A00 ⨀ entity ⨁ U+02A01 ⨁ entity ⨂ U+02A02 ⨂ entity ⨆ U+02A06 ⨆ entity ★ U+02605 ★ entity ⋁ U+022C1 ⋁ entity ⋀ U+022C0 ⋀ entity █ U+02588 █ entity

▪ U+025AA ▪ entity ▴ U+025B4 ▴ entity ␣ U+02423 ␣ entity NOT BLANK;)

⊥ U+022A5 ⊥ entity • U+02022 • entity · U+000B7 · entity

✓ U+02713 ✓ entity ✓ U+02713 ✓ entity

⊖ U+02296 ⊖ entity ⊕ U+02295 ⊕ entity ⊗ U+02297 ⊗ entity

♣ U+02663 ♣ entity ♣ U+02663 ♣ entity ∷ U+02237 ∷ entity : U+0003A : entity © U+000A9 © entity ⨯ U+02A2F ⨯ entity ∪ U+0222A ∪ entity ‐ U+02010 ‐ entity ° U+000B0 ° entity

⋄ U+022C4 ⋄ entity ♦ U+02666 ♦ entity

÷ U+000F7 ÷ entity ÷ U+000F7 ÷ entity

$ U+00024 $ entity

¨ U+000A8 ¨ entity ⇓ U+021D3 ⇓ entity ¨ U+000A8 ¨ entity ˙ U+002D9 ˙ entity

↓ U+02193 ↓ entity

ð U+000F0 ð entity ∃ U+02203 ∃ entity ∃ U+02203 ∃ entity

∀ U+02200 ∀ entity

½ U+000BD ½ entity ½ U+000BD ½ entity …

♥ U+02665 ♥ entity ♥ U+02665 ♥ entity ‐ U+02010 ‐ entity

∈ U+02208 ∈ entity

∫ U+0222B ∫ entity ⚠️ ℤ U+02124 ℤ entity ∫ U+0222B ∫ entity

⁢ U+02062 ⁢ entity ⚠️ ??

κ U+003BA κ entity λ U+003BB λ entity

abbreviations

open questions :

• Should partial entity names be completed by the IDE or also be allowed in uniscript :nat :hyph :alp ?

Home

Philosophy

data & code blocks

features

inventions

evaluation

keywords

iteration

tasks

examples

todo : bad ideas and open questions

⚠️ specification and progress are out of sync

Clone this wiki locally