Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MimePart.to_string() can produce different output depending on MimePart.was_changed() #194

Open
mhahnenberg opened this issue May 14, 2018 · 6 comments

Comments

@mhahnenberg
Copy link
Contributor

mhahnenberg commented May 14, 2018

RFC1341 states:
"The Quoted-Printable encoding REQUIRES that encoded lines be no more than 76 characters long."

If we pass a long 7bit text part with simple '\n' in it, it will be encoded as quoted-printable which leads to invalid output. This is due to two things:

  • has_long_lines() uses str.splitlines(), which splits universal newlines (which include '\n'), thus reducing the perceived length of the lines (even though they're not considered line breaks in the final quoted-printable CTE). Instead I believe it should use str.split('\r\n').

  • Even after fixing this, we still pass quoted-printable as the preferred encoding inside _choose_text_encoding even though the line length is too long. I think we should either prefer base64 for text parts with long lines (simpler implementation, but base64 incurs a higher overhead) or convert the long lines to a quoted-printable compatible form using "soft" line breaks.

Relevant code is here: https://github.com/mailgun/flanker/blob/master/flanker/mime/message/part.py#L657

@mhahnenberg
Copy link
Contributor Author

I can make these changes if all of this makes sense, I just wanted to discuss it first and make sure I'm understanding everything correctly.

@horkhe
Copy link
Member

horkhe commented May 14, 2018

All is reasonable. I would prefer the soft line breaks option. @b0d0nne11 what do you think?

@mhahnenberg
Copy link
Contributor Author

mhahnenberg commented May 14, 2018

After some more investigation it appears quopri.encodestring does the correct thing with line lengths > 76, so that removes the need to handle the 2nd bullet. I assumed that it didn't bother with line length-related things because the documentation for the quopri module didn't explicitly mention it.

That leaves just the first bullet, which is a much easier change 😄

@mhahnenberg
Copy link
Contributor Author

mhahnenberg commented May 15, 2018

Okay, so I did even more investigation and I think the original problem I described doesn't exist. I think the actual issue is that when we first scan a MIME string the resulting MimePart(s) return False for was_changed(), so when we call message.to_string() we just return the string as it was scanned in without re-encoding it. This avoids all the logic of detecting which CTE should be used. The following test demonstrates this:

def text_long_line_quoted_printable():
    # TEXT_LONG_LINES is a string containing a MIME message with lines 
    # longer than allowed for quoted-printable.
    message = scan(TEXT_LONG_LINES)

    # Without the following lines this test will fail.
    # message._container._body_changed = True
    # assert message.was_changed(ignore_prepends=True)
    print(message.to_string())
    eq_('quoted-printable', message.content_encoding.value)
    eq_('ascii', message.content_type.get_charset())

Is this the desired behavior?

@mhahnenberg mhahnenberg changed the title Handling of long lines in text parts can violate RFC1341 Round-trip parsing doesn't produce the same output May 15, 2018
@mhahnenberg mhahnenberg changed the title Round-trip parsing doesn't produce the same output MimePart.to_string() can produce different output depending on MimePart.was_changed() May 15, 2018
@horkhe
Copy link
Member

horkhe commented May 15, 2018

@mhahnenberg the idea is that if we do not need to make changes to the mime the we return it as is. This is intentional. Originally we did encode message in to_string even if it was not changed, but some customers complained (do not remember why though) and we introduced this is_changed behavior.

@anton-efimenko may be you remember what why customers complained?

@mhahnenberg
Copy link
Contributor Author

You're going to laugh at me, and I'm really sorry for continually changing what this issue is about, but this has been a very tricky bug for me to figure out 😞 However, I think I might have it now (hopefully).

flanker hardcodes line endings as CRLF by default. To encode messages as quoted-printable flanker uses quopri which tries to automatically detect which line ending to use for soft line breaks based on its input (see https://bugs.python.org/issue20121). quopri must generate soft line breaks for long lines to obey the 76 character limit imposed by the quoted-printable standard.

By default quopri assumes LF unless it runs into a CRLF first. flanker encodes each part's body in isolation, so if a part's body contains only LF line endings (or no line endings at all) quopri will assume that it should use =\n for soft line breaks when in reality we always want =\r\n since CRLF is hardcoded in flanker and it is unwise to mix line endings.

I think a quick fix for this would be for flanker to trick quopri by prepending a CRLF to the part's body and then removing it after the encoding has been performed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants