Skip to content
Bruce Mitchener edited this page Jun 4, 2014 · 7 revisions

Design Issues:

The overall idea here is:

  • Make sure we have a distinction between text and a sequence of bytes. Text is a <string> or <unicode-string>. A sequence of bytes is a <byte-vector> or a <buffer>.
  • <unicode-string> should be a defined and only a single encoding, either UTF-8 or UCS4 (to be determined).
  • Sequences of bytes have an encoding associated with them (even if only through intent and not structurally). They must be decoded to get a <string>.
  • <byte-string> should probably die.

API changes:

  • You can currently copy between a <byte-vector> and a <string> (using copy-bytes), but it isn't clear that should be possible.
  • Some things are specialized on <byte-string> which ideally is going away.
  • Is the I/O functionality streams and files sufficient to deal with text and byte buffers being distinct things?
  • What impact will this have on the C-FFI? (Which deals with <C-string> classes now.)

Other notes:

  • Source files and LID files should be defined to be UTF-8 encoded. We should not support alternate encodings for source text.

Getting Started

While we work out the above, there's a good bit of work that can be done initially.

  • <unicode-integer> should become an <integer> (or at least something of the right size rather than <double-byte>.
  • We are using tag 3 for <unicode-character>. Each of the compiler backends and runtimes needs to be aware of this and be double checked for correctness. (This includes verifying things like the implementation of primitive-unicode-character-as-raw and primitive-raw-as-unicode-chracter.)
  • Evaluate the impact of the compiler not being aware of the <unicode-string> in the way that it is aware of <byte-string>. What optimizations are missing due to this?
  • Work on the unicode-data-generator, in particular, issues identified in sources/app/unicode-data-generator/TODO.
  • Determine what Unicode functionality needs to be present in the core runtime and libraries to implement the functionality required by the DRM. (Things like uppercase, lowercase.)
  • Figure out what to do about improved case handling, like having title case alongside the existing uppercase and lowercase code.
  • Make <unicode-character> be limited to a 32 bit sized value where we can rather than word-sized. (This is important for 64 bit platforms, but is less important in the short term than just getting the algorithms working.)
  • Implement the Unicode algorithms in the strings library and the core runtime as appropriate. We can look at some code from Common Lisp that is being done for the GSOC this year. See notes below about this.
  • Improve our test coverage of Unicode stuff. (This can partially borrow from the GSOC work below.)
  • Figure out what encoders and decoders should look like. Write a UTF-8 encoder / decoder. Write other encoders (like UTF-16). See additional notes below.
  • Make streams work well with encoders. (Not sure what that means.)

GSOC SBCL / Unicode project

The work being done for the SBCL / Unicode project for GSOC (2014) is currently in https://github.com/krzysz00/sbcl/tree/unicode-algorithms. The important files are tools-for-build/ucd.lisp, src/code/target-{char,unicode}.lisp and tests/unicode*.

Limiting <unicode-character> to 32 bits

When we limit the size of <unicode-character> to 32 bits, we'll have to revisit some code that deals with repeated slots and limited vectors.

In the HARP backend, there is some code like this:

let op--slot-element =
  select(repeated-representation-size(type))
    1 => op--byte-element;
    2 => op--double-byte-element;
    otherwise => op--repeated-slot-element;
  end select;

We'll need to fix that and look for similar code and issues in the generic DFMC code as well as the C and LLVM backends.

We may also have to update the implementations of primitive-unicode-character-as-raw to extend the value to a word sized value for the raw object. (See the LLVM implementation of primitive-byte-character-as-raw.)

Encoders and Decoders

These should support translating between <character> / <string> and <byte-vector>. However, there are some other concerns:

  • Some things use <buffer> rather than <byte-vector>. Does that matter?
  • We may (eventually?) need -into! variants to reduce data copying.
  • Streams should have a single encoding.
  • We should default many things to UTF-8.