Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preparedHTML: Incorrect output of two-byte utf-8 characters in header data #432

Open
digidigital opened this issue Oct 2, 2024 · 3 comments

Comments

@digidigital
Copy link

Bug Metadata

  • Version of extract_msg: 0.49.0
  • Your python version: Python 3.10
  • How did you launch extract_msg?
    • [ x] My command line or
    • [ x] I used the extract_msg package

Describe the bug
If you have a two-byte character (like the German umlaut "ä") in the header data that is transmitted with =UTF8?Q? … ?= the result is two separate characters when using prepared-html.
Output for text or "regular" html is fine

What code did you use or can we use to reproduce this error?
Just try the attached text-email with --html --prepared-html

Is there a message.msg file you want to share to help us reproduce this?

Additional context
I tried to track the issue and i assume it is caused by passing the html as UTF-8-encoded bytes to beautiful soup (message_base.py, line 385) -> I assume the two-byte character is interpreted as two separate characters by bs

def getSaveHtmlBody(self, preparedHtml: bool = False, charset: str = 'utf-8', **_) -> bytes:
    """
    Returns the HTML body that will be used in saving based on the
    arguments.

    :param preparedHtml: Whether or not the HTML should be prepared for
        standalone use (add tags, inject images, etc.).
    :param charset: If the html is being prepared, the charset to use for
        the Content-Type meta tag to insert. This exists to ensure that
        something parsing the html can properly determine the encoding (as
        not having this tag can cause errors in some programs). Set this to
        ``None`` or an empty string to not insert the tag. (Default:
        'utf-8')
    :param _: Used to allow kwargs expansion in the save function.
        Arguments absorbed by this are simply ignored.
    """
    if self.htmlBody:
        # Inject the header into the data.
        data = self.injectHtmlHeader(prepared = preparedHtml)

        # If we are preparing the HTML, then we should
        if preparedHtml and charset:
            bs = bs4.BeautifulSoup(data, features = 'html.parser')
  • self.injectHtmlHeader returns bytes

  • this is caused by the replace function in injectHtmlHeader that encodes the string that is returned by htmlInjectableHeader as bytes

  • the string returned by htmlInjectableHeader has the Umlauts in the correct form

    def replace(bodyMarker):
        """
        Internal function to replace the body tag with itself plus the
        header.
        """
    
        # I recently had to change this and how it worked. Now we use a new
        # property of `MSGFile` that returns a special tuple of tuples to define
        # how to get all of the properties we are formatting. They are all
        # processed in the same way, making everything neat. By defining them
        # in each class, any class can specify a completely different set to be
        # used.
        return bodyMarker.group() + self.htmlInjectableHeader.encode('utf-8')
        # Use the previously defined function to inject the HTML header.
    

Potential fix
Decode the value passed to beatifulsoup in getSaveHtmlBody with .decode('utf-8') -> pass data as regular utf-8 string to bs

  if self.htmlBody:
      # Inject the header into the data.
      data = self.injectHtmlHeader(prepared = preparedHtml).decode('utf-8')
@TheElementalOfDestruction
Copy link
Collaborator

I've gone and modified the correct section of the data so it should not escape any non-ascii characters in the header. I would test this myself to ensure it is working correctly, but I don't have any examples with this issue that I can find, and the uploaded file is a .eml file rather than a .msg file. If you could either upload the correct file or download the version on the #next-release branch and test it yourself, that would be great.

I did try to manually do a conversion using outlook, but the file is causing RTFDE to throw mysterious errors, so now I have something else to look into as well 😅

TheElementalOfDestruction added a commit that referenced this issue Oct 2, 2024
@digidigital
Copy link
Author

Thx for the quick fix. I tested it for html, prepared html, text, and it works as expected. 😻

Here is the msg -> test.zip

I did not notice that a .msg you send as an attachment from Outlook/Windows is saved as .eml when you get the mail in Thunderbird/Ubuntu and save the attachment 😇

@TheElementalOfDestruction
Copy link
Collaborator

Yeah, the reason it saves that way is that often attaching the msg file will actually mangle it when the email gets sent.

I'll probably look a little bit harder at some of the other things that can influence the HTML body to ensure that this issue won't come up anywhere else and then I'll publish the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants