-
Notifications
You must be signed in to change notification settings - Fork 8
Description
First description:
YAML-PP-p5/lib/YAML/PP/Schema/Binary.pm
Lines 77 to 81 in 78a0506
By prepending a base64 encoded binary string with the C<!!binary> tag, it can | |
be automatically decoded when loading. | |
If you are using this schema, any string containing C<[\x{7F}-\x{10FFFF}]> | |
will be dumped as binary. That also includes encoded utf8. |
encode_base64()
is a function: {0x00..0xFF}ᴺ →{0x2B, 0x2F, 0x30..0x39, 0x41..0x5A, 0x61..7A}ᴹ
So YAML::PP::Schema::Binary obviously cannot take string which contains {0x000100..0x10FFFF}.
Binary data are defined as stream of octets, therefore from set {0x00..0xFF} like encode_base64()
function takes it.
And next implementation:
YAML-PP-p5/lib/YAML/PP/Schema/Binary.pm
Lines 29 to 39 in 78a0506
my $binary = $node->{value}; | |
unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) { | |
# ASCII | |
return; | |
} | |
if (utf8::is_utf8($binary)) { | |
# utf8 | |
return; | |
} | |
# everything else must be base64 encoded | |
my $base64 = encode_base64($binary); |
unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/)
is equivalent to if ($binary =~ m/[\x{00}-\x{7E}]/)
checks for all 7-bit ASCII characters except 7-bit ASCII \x{7F}
. Comment says that this code is for ASCII
which is not truth as it is ASCII
∖ {0x7F}.
Next there is check if (utf8::is_utf8($binary))
which in our case is basically:
if ($binary !~ m/[\x00-\xFF]/ and (int(rand(2))%2 == 1 or $binary =~ m/[\x80-\xFF]/))
So this code always skips strings with values from set {0x000100..0x10FFFF} and then (pseudo)-randomly (depending on internal representation of scalar) also skips strings from set {0x80..0xFF}. Comment says that this code is for utf8
, but it is not truth, see documentation and my above simplified implementation.
And finally it calls encode_base64
function. When this function is called? Strings with only {0x00..0x7E} are ignored by first ASCII
check. Then by second checks are always ignored strings which have at least one value from {0x000100..0x10FFFF}. So encode_base64
is called when string contains at least one value 0x7F
or when there is at least one value from set {0x80..0xFF} and is_utf8
(pseudo)-randomly returned false.
Suggested fix
-
Change YAML::PP::Schema::Binary module to work only with binary data as name suggests. Therefore with strings which contains only values from set {0x00..0xFF}.
-
Use encode_base64() always when input string is non-7-bit ASCII, therefore contains at least one value from set {0x80..0xFF}.
-
Decide if Base64 encoding is really needed for strings with character 0x7F. It is 7-bit ASCII therefore in 7-bit applications it is not needed special treatment for it. But I'm not sure if YAML needs special treatment of 0x7F or not.
CC @2shortplanks Please look at it and ideally do some correction if I wrote some mistake. Some parts I simplified to make it more easier to understand.