Skip to content

Commit c09fed5

Browse files
committed
Define TUTF-8 encoding and Tcl values.
2 parents 3f60f3e + 88aba4d commit c09fed5

File tree

14 files changed

+211
-158
lines changed

14 files changed

+211
-158
lines changed

doc/ByteArrObj.3

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ trigger proper error-handling), otherwise expect it to crash.
5353
.BE
5454
.SH DESCRIPTION
5555
.PP
56+
N.B. Refer to the \fBTcl_UniChar\fR documentation page for a description of the
57+
\fITUTF-8\fR encoding and related terms referenced here.
58+
.PP
5659
These routines are used to create, modify, store, transfer, and retrieve
5760
arbitrary binary data in Tcl values. Specifically, data that can be
5861
represented as a sequence of arbitrary byte values is supported.
@@ -74,8 +77,7 @@ of \fIN\fR bytes is transformed into the corresponding sequence
7477
of \fIN\fR characters, where each byte value transforms to the same
7578
character codepoint value in the range (U+0000 - U+00FF). Obtaining the
7679
string representation of a byte-array value (by calling
77-
\fBTcl_GetStringFromObj\fR) produces this string in Tcl's usual
78-
Modified UTF-8 encoding.
80+
\fBTcl_GetStringFromObj\fR) produces this string in TUTF-8 encoding.
7981
.PP
8082
\fBTcl_NewByteArrayObj\fR and \fBTcl_SetByteArrayObj\fR
8183
create a new value or overwrite an existing unshared value, respectively,

doc/CrtCommand.3

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ last value is NULL.
9797
Note that the argument strings should not be modified as they may
9898
point to constant strings or may be shared with other parts of the
9999
interpreter.
100-
Note also that the argument strings are encoded in normalized UTF-8 since
100+
Note also that the argument strings are encoded in normalized TUTF-8 since
101101
version 8.1 of Tcl.
102102
.PP
103103
\fIProc\fR must return an integer code that is expected to be one of

doc/DString.3

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,9 @@ dynamic string.
6565

6666
.SH DESCRIPTION
6767
.PP
68+
N.B. Refer to the \fBTcl_UniChar\fR documentation page for a description of the
69+
\fITUTF-8\fR encoding and related terms referenced here.
70+
.PP
6871
Dynamic strings provide a mechanism for building up arbitrarily long
6972
strings by gradually appending information. If the dynamic string is
7073
short then there will be no memory allocation overhead; as the string
@@ -144,7 +147,7 @@ an empty string.
144147
Since the dynamic string is reinitialized, there is no need to
145148
further call \fBTcl_DStringFree\fR on it and it can be reused without
146149
calling \fBTcl_DStringInit\fR. The caller must ensure that the dynamic
147-
string stored in \fIdsPtr\fR is encoded in Tcl's internal UTF-8 format.
150+
string stored in \fIdsPtr\fR is encoded in TUTF-8.
148151
.PP
149152
\fBTcl_DStringGetResult\fR does the opposite of \fBTcl_DStringResult\fR.
150153
It sets the value of \fIdsPtr\fR to the result of \fIinterp\fR and
@@ -160,7 +163,7 @@ copying the string. Since the dynamic string is reinitialized, there is no need
160163
to further call \fBTcl_DStringFree\fR on it and it can be reused without calling
161164
\fBTcl_DStringInit\fR. The returned \fBTcl_Obj\fR has a reference count of 0.
162165
The caller must ensure that the dynamic string stored in \fIdsPtr\fR is encoded
163-
in Tcl's internal UTF-8 format.
166+
in TUTF-8.
164167

165168
.SH KEYWORDS
166169
append, dynamic string, free, result

doc/Encoding.3

Lines changed: 32 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -78,11 +78,11 @@ Name of encoding to get token for.
7878
Points to storage where encoding token is to be written.
7979
.AP "const char" *src in
8080
For the \fBTcl_ExternalToUtf\fR functions, an array of bytes in the
81-
specified encoding that are to be converted to UTF-8. For the
82-
\fBTcl_UtfToExternal\fR function, an array of
83-
UTF-8 characters to be converted to the specified encoding.
81+
specified encoding that are to be converted to TUTF-8. For the
82+
\fBTcl_UtfToExternal\fR function, a TUTF-8 byte sequence
83+
to be converted to the specified encoding.
8484
.AP "const TCHAR" *tsrc in
85-
An array of Windows TCHAR characters to convert to UTF-8.
85+
An array of Windows TCHAR characters to convert to TUTF-8.
8686
.AP Tcl_Size srcLen in
8787
Length of \fIsrc\fR or \fItsrc\fR in bytes. If the length is negative, the
8888
encoding-specific length of the string is used.
@@ -104,7 +104,6 @@ control the encoding profile to be used for dealing with invalid data or
104104
other errors in the encoding transform.
105105
The flag \fBTCL_ENCODING_STOPONERROR\fR has no effect,
106106
it only has meaning in Tcl 8.x.
107-
.PP
108107
Some flags bits may not be usable with some functions as noted in the
109108
function descriptions below.
110109
.AP Tcl_EncodingState *statePtr in/out
@@ -146,25 +145,24 @@ A path to the location of the encoding file.
146145
.BE
147146
.SH INTRODUCTION
148147
.PP
149-
These routines convert between Tcl's internal character representation,
150-
UTF-8, and character representations used by various operating systems or
151-
file systems, such as Unicode, ASCII, or Shift-JIS. When operating on
152-
strings, such as such as obtaining the names of files or displaying
153-
characters using international fonts, the strings must be translated into
154-
one or possibly multiple formats that the various system calls can use. For
155-
instance, on a Japanese Unix workstation, a user might obtain a filename
156-
represented in the EUC-JP file encoding and then translate the characters to
157-
the jisx0208 font encoding in order to display the filename in a Tk widget.
158-
The purpose of the encoding package is to help bridge the translation gap.
159-
UTF-8 provides an intermediate staging ground for all the various
160-
encodings. In the example above, text would be translated into UTF-8 from
161-
whatever file encoding the operating system is using. Then it would be
162-
translated from UTF-8 into whatever font encoding the display routines
163-
require.
164-
.PP
165-
Some basic encodings are compiled into Tcl. Others can be defined by the
166-
user or dynamically loaded from encoding files in a
167-
platform-independent manner.
148+
N.B. Refer to the \fBTcl_UniChar\fR documentation page for a description of the
149+
\fITUTF-8\fR encoding and related terms referenced here.
150+
.PP
151+
These routines convert between TUTF-8
152+
and character representations using encodings such as
153+
standard UTF-8, UTF-16, ASCII, or Shift-JIS that might be expected by system
154+
interfaces or other software components. For instance, on a Japanese Unix
155+
workstation, a user might obtain a filename represented in the EUC-JP file
156+
encoding and then translate the characters to the jisx0208 font encoding in
157+
order to display the filename in a Tk widget. The purpose of the encoding
158+
package is to help bridge the translation gap. TUTF-8 provides an intermediate
159+
staging ground for all the various encodings. In the example above, text would
160+
be translated into TUTF-8 from whatever file encoding the operating system is
161+
using. Then it would be translated from TUTF-8 into whatever font encoding the
162+
display routines require.
163+
.PP
164+
Some basic encodings are compiled into Tcl. Others can be defined by the user or
165+
dynamically loaded from encoding files in a platform-independent manner.
168166
.SH DESCRIPTION
169167
.PP
170168
\fBTcl_GetEncoding\fR finds an encoding given its \fIname\fR. The name may
@@ -202,7 +200,7 @@ on the resulting encoding token when that token will no longer be
202200
used.
203201
.PP
204202
\fBTcl_ExternalToUtfDString\fR converts a source buffer \fIsrc\fR from the
205-
specified \fIencoding\fR into UTF-8. The converted bytes are stored in
203+
specified \fIencoding\fR into TUTF-8. The converted bytes are stored in
206204
\fIdstPtr\fR, which is then null-terminated. The caller should eventually
207205
call \fBTcl_DStringFree\fR to free any information stored in \fIdstPtr\fR.
208206
When converting, if any of the characters in the source buffer cannot be
@@ -231,7 +229,7 @@ The caller must call \fBTcl_DStringFree\fR to free up the \fB*dstPtr\fR resource
231229
irrespective of the return value from the function.
232230
.PP
233231
\fBTcl_ExternalToUtf\fR converts a source buffer \fIsrc\fR from the specified
234-
\fIencoding\fR into UTF-8. Up to \fIsrcLen\fR bytes are converted from the
232+
\fIencoding\fR into TUTF-8. Up to \fIsrcLen\fR bytes are converted from the
235233
source buffer and up to \fIdstLen\fR converted bytes are stored in \fIdst\fR.
236234
In all cases, \fI*srcReadPtr\fR is filled with the number of bytes that were
237235
successfully converted from \fIsrc\fR and \fI*dstWrotePtr\fR is filled with
@@ -259,7 +257,7 @@ The source buffer contained a character that could not be represented in
259257
the target encoding.
260258
.RE
261259
.LP
262-
\fBTcl_UtfToExternalDString\fR converts a source buffer \fIsrc\fR from UTF-8
260+
\fBTcl_UtfToExternalDString\fR converts a source buffer \fIsrc\fR from TUTF-8
263261
into the specified \fIencoding\fR. The converted bytes are stored in
264262
\fIdstPtr\fR, which is then terminated with the appropriate encoding-specific
265263
null. The caller should eventually call \fBTcl_DStringFree\fR to free any
@@ -269,7 +267,7 @@ encoding, a default fallback character will be used. The return value is
269267
a pointer to the value stored in the DString.
270268
.PP
271269
\fBTcl_UtfToExternalDStringEx\fR is an enhanced version of
272-
\fBTcl_UtfToExternalDString\fR that transforms UTF-8 encoded source data to a
270+
\fBTcl_UtfToExternalDString\fR that transforms TUTF-8 encoded source data to a
273271
specified \fIencoding\fR. Except for the direction of the transform, the
274272
parameters and return values are identical to those of
275273
\fBTcl_ExternalToUtfDStringEx\fR. See
@@ -278,7 +276,7 @@ that function above for details about the same.
278276
Irrespective of the return code from the function, the caller must free
279277
resources associated with \fB*dstPtr\fR when the function returns.
280278
.PP
281-
\fBTcl_UtfToExternal\fR converts a source buffer \fIsrc\fR from UTF-8 into
279+
\fBTcl_UtfToExternal\fR converts a source buffer \fIsrc\fR from TUTF-8 into
282280
the specified \fIencoding\fR. Up to \fIsrcLen\fR bytes are converted from
283281
the source buffer and up to \fIdstLen\fR converted bytes are stored in
284282
\fIdst\fR. In all cases, \fI*srcReadPtr\fR is filled with the number of
@@ -322,7 +320,7 @@ exist.
322320
.PP
323321
\fBTcl_CreateEncoding\fR defines a new encoding and registers the C
324322
procedures that are called back to convert between the encoding and
325-
UTF-8. Encodings created by \fBTcl_CreateEncoding\fR are thereafter
323+
TUTF-8. Encodings created by \fBTcl_CreateEncoding\fR are thereafter
326324
visible in the database used by \fBTcl_GetEncoding\fR. Just as with the
327325
\fBTcl_GetEncoding\fR procedure, the return value is a token that
328326
represents the encoding and can be used in subsequent calls to other
@@ -335,7 +333,7 @@ encoding procedures.
335333
.PP
336334
The \fItypePtr\fR argument to \fBTcl_CreateEncoding\fR contains information
337335
about the name of the encoding and the procedures that will be called to
338-
convert between this encoding and UTF-8. It is defined as follows:
336+
convert between this encoding and TUTF-8. It is defined as follows:
339337
.PP
340338
.CS
341339
typedef struct {
@@ -351,9 +349,9 @@ typedef struct {
351349
The \fIencodingName\fR provides a string name for the encoding, by
352350
which it can be referred in other procedures such as
353351
\fBTcl_GetEncoding\fR. The \fItoUtfProc\fR refers to a callback
354-
procedure to invoke to convert text from this encoding into UTF-8.
352+
procedure to invoke to convert text from this encoding into TUTF-8.
355353
The \fIfromUtfProc\fR refers to a callback procedure to invoke to
356-
convert text from UTF-8 into this encoding. The \fIfreeProc\fR refers
354+
convert text from TUTF-8 into this encoding. The \fIfreeProc\fR refers
357355
to a callback procedure to invoke when this encoding is deleted. The
358356
\fIfreeProc\fR field may be NULL. The \fIclientData\fR contains an
359357
arbitrary one-word value passed to \fItoUtfProc\fR, \fIfromUtfProc\fR,
@@ -513,7 +511,7 @@ FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
513511
.CE
514512
.PP
515513
The third line of the file is three numbers. The first number is the
516-
fallback character (in base 16) to use when converting from UTF-8 to this
514+
fallback character (in base 16) to use when converting from TUTF-8 to this
517515
encoding. The second number is a \fB1\fR if this file represents the
518516
encoding for a symbol font, or \fB0\fR otherwise. The last number (in base
519517
10) is how many pages of data follow.

doc/Eval.3

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,16 @@ The number of bytes in \fIscript\fR, not including any
6262
null terminating character. If \-1, then all characters up to the
6363
first null byte are used.
6464
.AP "const char" *script in
65-
Points to first byte of script to execute (null-terminated and UTF-8).
65+
Points to first byte of script to execute (null-terminated TUTF-8 byte sequence).
6666
.AP "const char" *part in
6767
String forming part of a Tcl script.
6868
.BE
6969

7070
.SH DESCRIPTION
7171
.PP
72+
N.B. Refer to the \fBTcl_UniChar\fR documentation page for a description of the
73+
\fITUTF-8\fR encoding and related terms referenced here.
74+
.PP
7275
The procedures described here are invoked to execute Tcl scripts in
7376
various forms.
7477
\fBTcl_EvalObjEx\fR is the core procedure and is used by many of the others.
@@ -113,7 +116,7 @@ elements of \fIobjv\fR, insuring that the values are valid until
113116
.PP
114117
\fBTcl_Eval\fR is similar to \fBTcl_EvalObjEx\fR except that the script to
115118
be executed is supplied as a string instead of a value and no compilation
116-
occurs. The string should be a proper UTF-8 string as converted by
119+
occurs. The string should be a proper TUTF-8 byte sequence as converted by
117120
\fBTcl_ExternalToUtfDString\fR or \fBTcl_ExternalToUtf\fR when it is known
118121
to possibly contain upper ASCII characters whose possible combinations
119122
might be a UTF-8 special code. The string is parsed and executed directly

doc/Object.3

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@ must have been the result of a previous call to \fBTcl_NewObj\fR.
3838
.BE
3939
.SH INTRODUCTION
4040
.PP
41+
N.B. Refer to the \fBTcl_UniChar\fR documentation page for a description of the
42+
\fITUTF-8\fR encoding and related terms referenced here.
43+
.PP
4144
This man page presents an overview of Tcl values (called \fBTcl_Obj\fRs for
4245
historical reasons) and how they are used.
4346
It also describes generic procedures for managing Tcl values.
@@ -140,9 +143,7 @@ typedef struct {
140143
.CE
141144
.PP
142145
The \fIbytes\fR and the \fIlength\fR members together hold
143-
a value's UTF-8 string representation,
144-
which is a \fIcounted string\fR not containing null bytes (UTF-8 null
145-
characters should be encoded as a two byte sequence: 192, 128.)
146+
a value's TUTF-8 byte sequence representation.
146147
\fIbytes\fR points to the first byte of the string representation.
147148
The \fIlength\fR member gives the number of bytes.
148149
The byte array must always have a null byte after the last data byte,

doc/OpenFileChnl.3

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,9 @@ New value for the option given by \fIoptionName\fR.
217217
.BE
218218
.SH DESCRIPTION
219219
.PP
220+
N.B. Refer to the \fBTcl_UniChar\fR documentation page for a description of the
221+
\fITUTF-8\fR encoding and related terms referenced here.
222+
.PP
220223
The Tcl channel mechanism provides a device-independent and
221224
platform-independent mechanism for performing buffered input
222225
and output operations on a variety of file, socket, and device
@@ -415,7 +418,7 @@ corresponding calls to \fBTcl_UnregisterChannel\fR.
415418
.SH "TCL_READCHARS AND TCL_READ"
416419
.PP
417420
\fBTcl_ReadChars\fR consumes bytes from \fIchannel\fR, converting the bytes
418-
to UTF-8 based on the channel's encoding and storing the produced data in
421+
to TUTF-8 based on the channel's encoding and storing the produced data in
419422
\fIreadObjPtr\fR's string representation. The return value of
420423
\fBTcl_ReadChars\fR is the number of characters, up to \fIcharsToRead\fR,
421424
that were stored in \fIreadObjPtr\fR. If an error occurs while reading, the
@@ -448,14 +451,14 @@ platform-specific modes are described in the manual entry for the Tcl
448451
\fBfconfigure\fR command.
449452
.PP
450453
As a performance optimization, when reading from a channel with the encoding
451-
\fBbinary\fR, the bytes are not converted to UTF-8 as they are read.
454+
\fBbinary\fR, the bytes are not converted to TUTF-8 as they are read.
452455
Instead, they are stored in \fIreadObjPtr\fR's internal representation as a
453456
byte-array value. The string representation of this value will only be
454457
constructed if it is needed (e.g., because of a call to
455458
\fBTcl_GetStringFromObj\fR). In this way, byte-oriented data can be read
456459
from a channel, manipulated by calling \fBTcl_GetByteArrayFromObj\fR and
457460
related functions, and then written to a channel without the expense of ever
458-
converting to or from UTF-8.
461+
converting to or from TUTF-8.
459462
.PP
460463
\fBTcl_Read\fR is similar to \fBTcl_ReadChars\fR, except that it does not do
461464
encoding conversions, regardless of the channel's encoding. It is deprecated
@@ -476,8 +479,8 @@ channel drivers, i.e. drivers used in the middle of a stack of
476479
channels, to move data from the channel below into the transformation.
477480
.SH "TCL_GETSOBJ AND TCL_GETS"
478481
.PP
479-
\fBTcl_GetsObj\fR consumes bytes from \fIchannel\fR, converting the bytes to
480-
UTF-8 based on the channel's encoding, until a full line of input has been
482+
\fBTcl_GetsObj\fR consumes bytes from \fIchannel\fR,
483+
based on the channel's encoding, until a full line of input has been
481484
seen. If the channel's encoding is \fBbinary\fR, each byte read from the
482485
channel is treated as an individual Unicode character. All of the
483486
characters of the line except for the terminating end-of-line character(s)
@@ -497,9 +500,9 @@ end-of-line character. When -1 is returned, the \fBTcl_InputBlocked\fR
497500
procedure may be invoked to determine if the channel is blocked because
498501
of input unavailability.
499502
.PP
500-
\fBTcl_Gets\fR is the same as \fBTcl_GetsObj\fR except the resulting
501-
characters are appended to the dynamic string given by
502-
\fIlineRead\fR rather than a Tcl value.
503+
\fBTcl_Gets\fR works similarly to \fBTcl_GetsObj\fR except the read bytes
504+
are are appended as a TUTF-8 encoded byte sequence to the dynamic string
505+
given by \fIlineRead\fR rather than a Tcl value.
503506
.SH "TCL_UNGETS"
504507
.PP
505508
\fBTcl_Ungets\fR is used to add data to the input queue of a channel,
@@ -515,7 +518,7 @@ added to the input queue. \fBTcl_Ungets\fR returns \fIinputLen\fR or
515518
.SH "TCL_WRITECHARS, TCL_WRITEOBJ, AND TCL_WRITE"
516519
.PP
517520
\fBTcl_WriteChars\fR accepts \fIbytesToWrite\fR bytes of character data at
518-
\fIcharBuf\fR. The UTF-8 characters in the buffer are converted to the
521+
\fIcharBuf\fR. The TUTF-8 encoded bytes in the buffer are converted to the
519522
channel's encoding and queued for output to \fIchannel\fR. If
520523
\fIbytesToWrite\fR is negative, \fBTcl_WriteChars\fR expects \fIcharBuf\fR
521524
to be null-terminated and it outputs everything up to the null.
@@ -539,16 +542,16 @@ channel. This is done even if the channel has no encoding.
539542
.PP
540543
\fBTcl_WriteObj\fR is similar to \fBTcl_WriteChars\fR except it
541544
accepts a Tcl value whose contents will be output to the channel. The
542-
UTF-8 characters in \fIwriteObjPtr\fR's string representation are converted
545+
characters in \fIwriteObjPtr\fR's string representation are converted
543546
to the channel's encoding and queued for output to \fIchannel\fR.
544547
As a performance optimization, when writing to a channel with the encoding
545-
\fBbinary\fR, UTF-8 characters are not converted as they are written.
548+
\fBbinary\fR, characters are not converted as they are written.
546549
Instead, the bytes in \fIwriteObjPtr\fR's internal representation as a
547550
byte-array value are written to the channel. The byte-array representation
548551
of the value will be constructed if it is needed. In this way,
549552
byte-oriented data can be read from a channel, manipulated by calling
550553
\fBTcl_GetByteArrayFromObj\fR and related functions, and then written to a
551-
channel without the expense of ever converting to or from UTF-8.
554+
channel without the expense of encoding conversion.
552555
.PP
553556
\fBTcl_Write\fR is similar to \fBTcl_WriteChars\fR except that it does not do
554557
encoding conversions, regardless of the channel's encoding. It is

doc/StringObj.3

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,11 +85,12 @@ Tcl_Obj *
8585
.SH ARGUMENTS
8686
.AS "const Tcl_UniChar" *appendObjPtr in/out
8787
.AP "const char" *bytes in
88-
Points to the first byte of an array of UTF-8-encoded bytes
89-
used to set or append to a string value.
88+
Points to a TUTF-8 (Tcl's internal modified UTF-8 encoding)
89+
byte sequence bytes used to set or append to a string value.
9090
This byte array may contain embedded null characters
9191
unless \fInumChars\fR is negative. (Applications needing null bytes
92-
should represent them as the two-byte sequence \fI\e300\e200\fR, use
92+
should represent them as the two-byte sequence \fI0xC0 0x80\fR used
93+
in TUTF-8 to represent a null byte, use
9394
\fBTcl_ExternalToUtf\fR to convert, or \fBTcl_NewByteArrayObj\fR if
9495
the string is a collection of uninterpreted bytes.)
9596
.AP Tcl_Size length in

0 commit comments

Comments
 (0)