Skip to content

Description and implementation of YAML::PP::Schema::Binary does not make sense #28

@pali

Description

@pali

First description:

By prepending a base64 encoded binary string with the C<!!binary> tag, it can
be automatically decoded when loading.
If you are using this schema, any string containing C<[\x{7F}-\x{10FFFF}]>
will be dumped as binary. That also includes encoded utf8.

encode_base64() is a function: {0x00..0xFF}ᴺ →{0x2B, 0x2F, 0x30..0x39, 0x41..0x5A, 0x61..7A}ᴹ
So YAML::PP::Schema::Binary obviously cannot take string which contains {0x000100..0x10FFFF}.

Binary data are defined as stream of octets, therefore from set {0x00..0xFF} like encode_base64() function takes it.

And next implementation:

my $binary = $node->{value};
unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/) {
# ASCII
return;
}
if (utf8::is_utf8($binary)) {
# utf8
return;
}
# everything else must be base64 encoded
my $base64 = encode_base64($binary);

unless ($binary =~ m/[\x{7F}-\x{10FFFF}]/)is equivalent to if ($binary =~ m/[\x{00}-\x{7E}]/) checks for all 7-bit ASCII characters except 7-bit ASCII \x{7F}. Comment says that this code is for ASCII which is not truth as it is ASCII ∖ {0x7F}.

Next there is check if (utf8::is_utf8($binary)) which in our case is basically:
if ($binary !~ m/[\x00-\xFF]/ and (int(rand(2))%2 == 1 or $binary =~ m/[\x80-\xFF]/))
So this code always skips strings with values from set {0x000100..0x10FFFF} and then (pseudo)-randomly (depending on internal representation of scalar) also skips strings from set {0x80..0xFF}. Comment says that this code is for utf8, but it is not truth, see documentation and my above simplified implementation.

And finally it calls encode_base64 function. When this function is called? Strings with only {0x00..0x7E} are ignored by first ASCII check. Then by second checks are always ignored strings which have at least one value from {0x000100..0x10FFFF}. So encode_base64 is called when string contains at least one value 0x7F or when there is at least one value from set {0x80..0xFF} and is_utf8 (pseudo)-randomly returned false.

Suggested fix

  1. Change YAML::PP::Schema::Binary module to work only with binary data as name suggests. Therefore with strings which contains only values from set {0x00..0xFF}.

  2. Use encode_base64() always when input string is non-7-bit ASCII, therefore contains at least one value from set {0x80..0xFF}.

  3. Decide if Base64 encoding is really needed for strings with character 0x7F. It is 7-bit ASCII therefore in 7-bit applications it is not needed special treatment for it. But I'm not sure if YAML needs special treatment of 0x7F or not.

CC @2shortplanks Please look at it and ideally do some correction if I wrote some mistake. Some parts I simplified to make it more easier to understand.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions