Skip to content

Commit 0d383f8

Browse files
gh-54874: Expand unicodedata module documentation (#138301)
Closes #54874 Co-authored-by: Alexander Belopolsky <[email protected]>
1 parent c9b252c commit 0d383f8

File tree

1 file changed

+68
-32
lines changed

1 file changed

+68
-32
lines changed

Doc/library/unicodedata.rst

Lines changed: 68 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -25,80 +25,133 @@ Standard Annex #44, `"Unicode Character Database"
2525
<https://www.unicode.org/reports/tr44/>`_. It defines the
2626
following functions:
2727

28+
.. seealso::
29+
30+
The :ref:`unicode-howto` for more information about Unicode and how to use
31+
this module.
32+
2833

2934
.. function:: lookup(name)
3035

3136
Look up character by name. If a character with the given name is found, return
3237
the corresponding character. If not found, :exc:`KeyError` is raised.
38+
For example::
39+
40+
>>> unicodedata.lookup('LEFT CURLY BRACKET')
41+
'{'
42+
43+
The characters returned by this function are the same as those produced by
44+
``\N`` escape sequence in string literals. For example::
45+
46+
>>> unicodedata.lookup('MIDDLE DOT') == '\N{MIDDLE DOT}'
47+
True
3348

3449
.. versionchanged:: 3.3
3550
Support for name aliases [#]_ and named sequences [#]_ has been added.
3651

3752

38-
.. function:: name(chr[, default])
53+
.. function:: name(chr, default=None, /)
3954

4055
Returns the name assigned to the character *chr* as a string. If no
4156
name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
42-
raised.
57+
raised. For example::
58+
59+
>>> unicodedata.name('½')
60+
'VULGAR FRACTION ONE HALF'
61+
>>> unicodedata.name('\uFFFF', 'fallback')
62+
'fallback'
4363

4464

45-
.. function:: decimal(chr[, default])
65+
.. function:: decimal(chr, default=None, /)
4666

4767
Returns the decimal value assigned to the character *chr* as integer.
4868
If no such value is defined, *default* is returned, or, if not given,
49-
:exc:`ValueError` is raised.
69+
:exc:`ValueError` is raised. For example::
5070

71+
>>> unicodedata.decimal('\N{ARABIC-INDIC DIGIT NINE}')
72+
9
73+
>>> unicodedata.decimal('\N{SUPERSCRIPT NINE}', -1)
74+
-1
5175

52-
.. function:: digit(chr[, default])
76+
77+
.. function:: digit(chr, default=None, /)
5378

5479
Returns the digit value assigned to the character *chr* as integer.
5580
If no such value is defined, *default* is returned, or, if not given,
56-
:exc:`ValueError` is raised.
81+
:exc:`ValueError` is raised::
82+
83+
>>> unicodedata.digit('\N{SUPERSCRIPT NINE}')
84+
9
5785

5886

59-
.. function:: numeric(chr[, default])
87+
.. function:: numeric(chr, default=None, /)
6088

6189
Returns the numeric value assigned to the character *chr* as float.
6290
If no such value is defined, *default* is returned, or, if not given,
63-
:exc:`ValueError` is raised.
91+
:exc:`ValueError` is raised::
92+
93+
>>> unicodedata.numeric('½')
94+
0.5
6495

6596

6697
.. function:: category(chr)
6798

6899
Returns the general category assigned to the character *chr* as
69-
string.
100+
string. General category names consist of two letters.
101+
See the `General Category Values section of the Unicode Character
102+
Database documentation <https://www.unicode.org/reports/tr44/#General_Category_Values>`_
103+
for a list of category codes. For example::
104+
105+
>>> unicodedata.category('A') # 'L'etter, 'u'ppercase
106+
'Lu'
70107

71108

72109
.. function:: bidirectional(chr)
73110

74111
Returns the bidirectional class assigned to the character *chr* as
75112
string. If no such value is defined, an empty string is returned.
113+
See the `Bidirectional Class Values section of the Unicode Character
114+
Database <https://www.unicode.org/reports/tr44/#Bidi_Class_Values>`_
115+
documentation for a list of bidirectional codes. For example::
116+
117+
>>> unicodedata.bidirectional('\N{ARABIC-INDIC DIGIT SEVEN}') # 'A'rabic, 'N'umber
118+
'AN'
76119

77120

78121
.. function:: combining(chr)
79122

80123
Returns the canonical combining class assigned to the character *chr*
81124
as integer. Returns ``0`` if no combining class is defined.
125+
See the `Canonical Combining Class Values section of the Unicode Character
126+
Database <www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values>`_
127+
for more information.
82128

83129

84130
.. function:: east_asian_width(chr)
85131

86132
Returns the east asian width assigned to the character *chr* as
87-
string.
133+
string. For a list of widths and or more information, see the
134+
`Unicode Standard Annex #11 <https://www.unicode.org/reports/tr11/>`_.
88135

89136

90137
.. function:: mirrored(chr)
91138

92139
Returns the mirrored property assigned to the character *chr* as
93140
integer. Returns ``1`` if the character has been identified as a "mirrored"
94-
character in bidirectional text, ``0`` otherwise.
141+
character in bidirectional text, ``0`` otherwise. For example::
142+
143+
>>> unicodedata.mirrored('>')
144+
1
95145

96146

97147
.. function:: decomposition(chr)
98148

99149
Returns the character decomposition mapping assigned to the character
100150
*chr* as string. An empty string is returned in case no such mapping is
101-
defined.
151+
defined. For example::
152+
153+
>>> unicodedata.decomposition('Ã')
154+
'0041 0303'
102155

103156

104157
.. function:: normalize(form, unistr)
@@ -122,9 +175,9 @@ following functions:
122175
normally would be unified with other characters. For example, U+2160 (ROMAN
123176
NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
124177
However, it is supported in Unicode for compatibility with existing character
125-
sets (e.g. gb2312).
178+
sets (for example, gb2312).
126179

127-
The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
180+
The normal form KD (NFKD) will apply the compatibility decomposition, that is,
128181
replace all compatibility characters with their equivalents. The normal form KC
129182
(NFKC) first applies the compatibility decomposition, followed by the canonical
130183
composition.
@@ -133,6 +186,7 @@ following functions:
133186
a human reader, if one has combining characters and the other
134187
doesn't, they may not compare equal.
135188

189+
136190
.. function:: is_normalized(form, unistr)
137191

138192
Return whether the Unicode string *unistr* is in the normal form *form*. Valid
@@ -154,24 +208,6 @@ In addition, the module exposes the following constant:
154208
Unicode database version 3.2 instead, for applications that require this
155209
specific version of the Unicode database (such as IDNA).
156210

157-
Examples:
158-
159-
>>> import unicodedata
160-
>>> unicodedata.lookup('LEFT CURLY BRACKET')
161-
'{'
162-
>>> unicodedata.name('/')
163-
'SOLIDUS'
164-
>>> unicodedata.decimal('9')
165-
9
166-
>>> unicodedata.decimal('a')
167-
Traceback (most recent call last):
168-
File "<stdin>", line 1, in <module>
169-
ValueError: not a decimal
170-
>>> unicodedata.category('A') # 'L'etter, 'u'ppercase
171-
'Lu'
172-
>>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
173-
'AN'
174-
175211

176212
.. rubric:: Footnotes
177213

0 commit comments

Comments
 (0)