Skip to content

Latest commit

 

History

History
49 lines (40 loc) · 1.77 KB

CL-JSON的编码问题.md

File metadata and controls

49 lines (40 loc) · 1.77 KB

CL-JSON的编码问题

tags: common lisp; json; cl-json; bug; unicode; Surrogate Pairs;

拿Common Lisp给IRC频道和Telegram群写传话Bot。想要支持表情功能,结果发现提取表情的字符串后,解码错误,搞了好久没搞定,真是扫兴。最后发现问题处在CL-JSON这个库上,那个库的decoder.lisp文件中,解析“\u”转义序列的代码是这样的:

((len rdx)
 (let ((code
        (let ((repr (make-string len)))
          (dotimes (i len)
            (setf (aref repr i) (read-char stream)))
          (handler-case (parse-integer repr :radix rdx)
            (parse-error ()
              (json-syntax-error stream esc-error-fmt
                                 (format nil "\\~C" c)
                                 repr))))))
   (restart-case
       (or (and (< code char-code-limit) (code-char code))
           (error 'no-char-for-code :code code))
...
...

显然,这里没有实现Surrogate Pairs,所以被编码成"\ud83d\ude03"形式的Emoji表情会被劈成2个字符分别读取,对Lisp来说,就变成乱码了。改天有空给他们修一修,先来个Dirty Hack。

xxx对于SBCL来说是Surrogate Pairs的两个字符(这时bug已经发生了,我们将错就错),如此可以得到原字符:

(progn
  (setf xxx (with-input-from-string (stream "\"\\uD83D\\uDE03\"")
              (cl-json:decode-json stream)))

  (princ (code-char
          (let ((c1 (char-code (aref xxx 0)))
                (c2 (char-code (aref xxx 1))))
            (+ #x10000
               (ash (logand #x03FF c1) 10)
               (logand #x03FF c2))))))

输出是:

😃
#\SMILING_FACE_WITH_OPEN_MOUTH

[1] http://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs