Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iconv always silent, -c broken #471

Open
enh-google opened this issue Dec 15, 2023 · 11 comments
Open

iconv always silent, -c broken #471

enh-google opened this issue Dec 15, 2023 · 11 comments

Comments

@enh-google
Copy link
Collaborator

going through help text, i noticed that iconv's -s isn't documented because it's always on, which seems like an unfortunate default.

more than that, the -c behavior isn't right either:

$ echo -e "a\x80b" | ./toybox iconv -t LATIN1 -f UTF-8 
a?b
$ echo -e "a\x80b" | ./toybox iconv -t LATIN1 -f UTF-8 -c
a$ 

(i'm trying not to get distracted from my more trivial goal, so i'll just file bugs for any weirdness i spot!)

@landley
Copy link
Owner

landley commented Dec 15, 2023

My devuan balseraph man page says "-s This option is ignored; it is provided only for compatibility", and providing it doesn't change the behavior here? (Do you want it to say "Illegal input sequence at position 1" like the other one, whether -s is provided or not, or to produce its current output but exit with an error if some chars couldn't be converted?)

I'm not getting a $ with -c, I'm getting "ab" from toybox iconv built against musl. (I think the binary in my $PATH should be the same one at https://landley.net/bin/toybox/latest/toybox-x86_64 ?)

Generally when an option flag in an optstr A) isn't documented, B) doesn't do anything, it's just there for compatibility. I don't have a way to mark these, nor do I have an entirely standard policy on them (patch lists -u and says it's ignored, but that's largely because "patch -u" was kind of a big deal for ages. I have a local help text redo that yanks the -u line but adds (-u is assumed and ignored) to second paragraph, haven't checked it in yet because... well it's a judgement call, innit?

I expect a non-zero number of people to learn to use a command from the toybox help text. (And presumably go on to read bigger docs eventually, but I still don't want to rule OUT people being able to do that.) And thus explicitly telling them "you don't need to know this, ignore it", while trying to be terse seems counterproductive.

@landley
Copy link
Owner

landley commented Dec 15, 2023

Try commit 53d8a67 maybe?

@enh-google
Copy link
Collaborator Author

Do you want it to say "Illegal input sequence at position 1" like the other one, whether -s is provided or not, or to produce its current output but exit with an error if some chars couldn't be converted?)

oh, that's sad. i'd believed https://pubs.opengroup.org/onlinepubs/009604499/utilities/iconv.html which has a much more useful description. but, yes, my linux box has the breakage you describe:

~$ echo -e "a\x80b" | iconv -t LATIN1 -f UTF-8
aiconv: illegal input sequence at position 1
~$ echo -e "a\x80b" | iconv -t LATIN1 -f UTF-8 -s
aiconv: illegal input sequence at position 1
~$ echo -e "a\x80b" | iconv -t LATIN1 -f UTF-8 -c
ab
~$ 

how about macOS?

~$ echo -e "a\x80b" | iconv -t LATIN1 -f UTF-8
iconv: iconv(): Illegal byte sequence
a~$ echo -e "a\x80b" | iconv -t LATIN1 -f UTF-8 -c
ab
~$ echo -e "a\x80b" | iconv -t LATIN1 -f UTF-8 -s
iconv: iconv(): Illegal byte sequence
a~$ 

so also not what i'd expected from POSIX, but different to the gnu one.

I'm not getting a $ with -c

that was my prompt :-)

here's toybox on macOS with your latest patch:

~/toybox-iconv$ echo -e "a\x80b" | ./toybox iconv -t LATIN1 -f UTF-8
a?b
~/toybox-iconv$ echo -e "a\x80b" | ./toybox iconv -t LATIN1 -f UTF-8 -c
ab
~/toybox-iconv$ echo -e "a\x80b" | ./toybox iconv -t LATIN1 -f UTF-8 -s
a?b
~/toybox-iconv$ 

that's certainly better (and i think the '?'s are reasonable and defensible). i'm less tempted by the error message now than i was when i believed that you could disable it with -s, but we should still probably have an error message?

POSIX certainly allows not having a message ("When -s is not used, the results of encountering invalid characters in the input stream (either those that are not valid characters in the codeset of the input file or that have no corresponding character in the codeset of the output file) shall be specified in the system documentation."), though both -c and -s say "The presence or absence of $FLAG shall not affect the exit status of iconv" ... but that's a "bug" you share with gnu and macOS, so i'd argue POSIX just isn't reflecting reality here.

@landley
Copy link
Owner

landley commented Dec 16, 2023

My commit changed the error return in case anybody's testing it (so a script CAN see it wasn't a full conversion), but I didn't output something that would stomp the conversion.

I can output the data to stdout and an error message to stderr, but for humans those would interleave and arguably be net worse? The question is whether partial results are useful. The error message kind of implies refusing to produce any useful output because one thing went wrong.

@enh-google
Copy link
Collaborator Author

The error message kind of implies refusing to produce any useful output because one thing went wrong.

which is basically the macOS (bsd?) behavior. i suspect their argument would be "if you wanted us to either ignore transliteration failures, or use a replacement character, you should have suffixed the encoding name with //IGNORE or whatever".

although it surprised me (and if we go this way, the --help output -- if not the error message! -- should definitely mention how you get out of this), i'm coming round to it (modulo quality of docs/errors) possibly being the least-worst of the options.

that said ... i've personally only ever used iconv(1) to test bionic's iconv(3).

a quick code search of all the code i have access to turned up 80% of call sites being from one UTF to another (8 to 16, say, or 16 to 32). there were a few lossy conversions to ascii, but they were mostly -t ascii//TRANSLIT anyway. i saw one -t UTF-8//TRANSLIT that was presumably someone who had actually seen invalid byte sequences in the input they were dealing with.

@landley
Copy link
Owner

landley commented Dec 17, 2023 via email

@enh-google
Copy link
Collaborator Author

I don't think we're checking for //TRANSLIT or //IGNORE. All that stuff is
passed on to libc and handled by libc, and is presumably thus libc's problem?

correct. and my guess is that's why the other iconv(1)s say "not our problem --- tell iconv_open(3) what you want via our -f and -t arguments".

And //IGNORE says that characters that cannot be converted are discarded, it
does NOT say to stop early...

no, that was my point --- i think the macOS (bsd?) iconv(1) author said "look, i'm just going to stop early because if you don't want that, you can just tell iconv_open(3) via //IGNORE".

(whereas POSIX doesn't have those, which is probably why it talks about -s and -c even though they don't let you distinguish between "from" and "to" like //IGNORE and //TRANSLIT do.)

i'm a bit wary of not having a diagnostic (because everyone else does, and they may know something we don't), but i think the current behavior is probably good enough. i'd be tempted to mention //IGNORE and //TRANSLIT in the help text ... except of course musl doesn't implement them because they're not in POSIX :-(

@landley
Copy link
Owner

landley commented Dec 19, 2023

If //IGNORE says to print an error message, how is glibc supposed to implement that?

@enh-google
Copy link
Collaborator Author

If //IGNORE says to print an error message, how is glibc supposed to implement that?

no, //IGNORE isn't an iconv(1) thing -- it's an iconv(3) thing -- so this is exactly where glibc/bionic steps in. i'm just guessing that for BSD/macOS the same person who wrote iconv(3) wrote iconv(1) so their iconv(1) assumes that you have (and know all the intricacies of!) their iconv(3), and don't need special hand-holding in iconv(1).

@landley
Copy link
Owner

landley commented Mar 22, 2024

Looking at this one again, whose court is this ball currently in?

Right now the iconv.c command loop is ignoring the return value of iconv() and instead looking at the adjusted in/inlen out/outlen values, advancing past leading bad characters in the input to try to get through it (skipping them for -c and passing them through verbatim otherwise).

But if the library function is printing error messages, then presumably we need to care about -1 return values, if nothing else to avoid spamming multiple error messages?

Is there a value that says "an error message was produced"? Do we abort on any -1? (Which might render the skipping behavior moot...?) Should I check for //TRANSLIT or //IGNORE myself?

@enh-google
Copy link
Collaborator Author

But if the library function is printing error messages, then presumably we need to care about -1 return values, if nothing else to avoid spamming multiple error messages?

the library function does not print error messages in any implementation i'm aware of.

if there's an EILSEQ from a conversion inside iconv(3):

  1. //IGNORE gets you no output byte and keeps going.
  2. //TRANSLIT gets you a ? for that output byte and keeps going.
  3. otherwise iconv(3) passes on the -1/EILSEQ to its caller (and the caller is responsible for printing any diagnostic/doing any more complicated replacement than ?).

Should I check for //TRANSLIT or //IGNORE myself?

no, that would undermine the purpose of those modifiers. (and no other iconv(1) seems to do this.) i think if toybox iconv(1) mentions the iconv(3) //TRANSLIT or //IGNORE modifiers at all, it should be in the help text, but you probably don't want to do that since they're not POSIX so musl of course doesn't implement them. (which is unfortunate because, as we've seen, everyone else's iconv(3) helps you paper over iconv(1)'s options being shit.)

Is there a value that says "an error message was produced"? Do we abort on any -1?

any -1. EINVAL vs EILSEQ gives you a bit of detail about what was wrong with the input sequence. (there's also E2BIG for running out of output buffer, but toybox should never see that. in any case, i think perror_exit() is the way to go rather than actually looking at errno.)

and i think that's all there is for toybox to do here? if iconv(3) returns -1 && !FLAG(c), perror_exit()? that would match both gnu and bsd iconv(1)?

(and there's no point implementing -s because it didn't seem to actually do anything useful.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants