Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parsing for .olm (Outlook for Mac) Files. #244

Open
TheElementalOfDestruction opened this issue Feb 11, 2022 · 8 comments
Open

Add Parsing for .olm (Outlook for Mac) Files. #244

TheElementalOfDestruction opened this issue Feb 11, 2022 · 8 comments
Assignees
Labels
Accepted This feature request has been accepted and will be developed enhancement

Comments

@TheElementalOfDestruction
Copy link
Collaborator

TheElementalOfDestruction commented Feb 11, 2022

Add support for generally parsing and handling .olm files. Need to see if I can track down proper documentation of these, but from what I have observed they seem rather simple. They are a renamed zip file composed of folders and (mostly) xml files. If a directory has emails, it seems to use the following format:

  • Email xml file where the name is __message_attachment__{id}.xml (right now I have only seen the ID as a 6 digit number, unclear if hex or decimal).
  • Attachments are stored in a subfolder com.microsoft.__Attachments in files using the message id for the name, followed by an underscore and a 4 digit number, presumably the id of the attachment for the specified message.
  • Email xmls start with the <emails> tag, presumably allowing them to store more than one email (which are denoted by the <email> tag. Names of properties within it have, so far, been reliably observed to be in the format OPFMessageCopy{name} (so AttachmentList would become OPFMessageCopyAttachmentList, Body would become OPFMessageCopyBody, etc.).
  • Attachments seem to also have a full url (position in zip file) for them, however I would advise trying to have a system to first check for attachments like that but then also check for attachments in the previously observed way, only adding them if they are not listed.
@TheElementalOfDestruction TheElementalOfDestruction added enhancement Accepted This feature request has been accepted and will be developed labels Feb 11, 2022
@ReblochonMasque
Copy link

This is great, thank you.

Of note:

  • This is the structure observed on an .olm file produced by Microsoft Outlook For Mac 2011, version <placeholder>
  • The file was built via File - export - email only - filtered over one category only
    • i/e other objects (contacts, tasks, calendar, etc.) were not included.
  • additional .olm files will be needed for validation of the data structures and for testing (my first thought is that this will be hard to get, or tedious to produce)

@TheElementalOfDestruction
Copy link
Collaborator Author

Should also be noted that despite that filtering, it seems to have created (potentially) full folder structures as if it was going to write that data, but just didn't put the actual files.

@ReblochonMasque
Copy link

ReblochonMasque commented Feb 12, 2022

Oh my! It is leaking info like krazy! Typical microsoft "betrayal of their duty to users".
Let's not use this publicly if you please.
...that makes building test data, let alone clean test data more difficult, as it cannot reasonably be extracted from a live outlook instance.

@TheElementalOfDestruction
Copy link
Collaborator Author

I wasn't intending to add the test file you gave me to the official testing, so no worries there. However if I can adequatly build an instance of outlook that only exists for tests, where everything it touches is intended to be public anyways, then files from that should be able to be safely added to testing.

Unfortunately this suddenly became a lot more complicated, as the format is not part of the Microsoft Open Specifications, meaning it does not have official documentation that is public nor is it guaranteed to be consistent. This will make any attempt at a parser much more of a challenge.

@ReblochonMasque
Copy link

Thank you!
Yes, I was also thinking that a dedicated instance of outlook would be needed for this purpose.
One approach could be to first build a crude/simple prototype parser, and throw a large .olm db at it to see if the number of special cases is reasonable.

@TheElementalOfDestruction
Copy link
Collaborator Author

I wanted to see if I could make something dead simple to parse it, just grabbing each of the tags and making anything found accesible directly, but some of the tags are lists (categories, attachments, addresses), some have properties inside the tag... Would probably need some additional code to find some of the patterns and parse them correctly. I'll probably wait for that testing environment before really getting to work on this, so that I can get a much better idea of how this should look.

@ReblochonMasque
Copy link

I've built a "proof of concept", mostly to convince myself, and as a learning exercise - it is clear that some mails are more straightforward than others to handle. It is easy when bodies/contents, are plain text marked up with vanilla html. What stumped me were those where the body content was what looked like the format of .docx documents - maybe there are parsers for that already? Maybe there are other formats that are hard to parse too, but I have not encountered those yet.
I have not attempted to parse nested structures with forwarded e-mails, attachments, etc.

Have you got an idea how to set up a dedicated instance of outlook to generate tests?

@TheElementalOfDestruction
Copy link
Collaborator Author

A lot of time they are built with a form of HTML that had additional tags for formatting in word and stuff, but it renders fine as plain HTML. Not sure if these are what you are talking about, but they also sometimes have branching conditions that actually will check for word to be there to even activate, having something to fall through if it's not available and render correctly in something like a browser.

The way I would do it is to setup a computer with a fresh install of outlook designed to not have anything on it aside from information that can be public

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted This feature request has been accepted and will be developed enhancement
Projects
None yet
Development

No branches or pull requests

2 participants