diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 4dc97cbc..b8c84198 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -5,6 +5,7 @@ - [ ] If necessary, have you bumped the version number? We will usually do this for you. - [ ] Have you included py.test tests with your pull request. (Not yet necessary) - [ ] Ensured your code is as close to PEP 8 compliant as possible? +- [ ] Ensured your pull request is to the next-release branch? If you haven't completed the above items, please wait to create a PR until you have done so. We will try to review and reply to PRs as quickly as possible. diff --git a/CHANGELOG.md b/CHANGELOG.md index 5c6ea369..ef69de0a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,61 @@ +**v0.29.0** +* [[TeamMsgExtractor #207](https://github.com/TeamMsgExtractor/msg-extractor/issues/207)] Made it so that unspecified dates are handled properly. For clarification, an unspecified date is a custom value in MSG files for dates that means that the date is unspecified. It is distinctly different from a property not existing, which will still return None. For unspecified dates, `datetime.datetime.max` is returned. While perhaps not the best solution, it will have to do for now. +* Fixed an issue where `utils.parseType` was returning a string for the date when it makes more sense to return an actual datetime instance. +* [[TeamMsgExtractor #165](https://github.com/TeamMsgExtractor/msg-extractor/issues/165)] [[TeamMsgExtractor #191](https://github.com/TeamMsgExtractor/msg-extractor/issues/191)] Completely redesigned all existing save functions. You can now properly save to custom locations under custom file names. This change may break existing code for several reasons. First, all arguments have been changed to keyword arguments. Second, a few keyword arguments have been renamed to better fit the naming conventions. +* [[TeamMsgExtractor #200](https://github.com/TeamMsgExtractor/msg-extractor/issues/200)] Changed imports to use relative imports instead of hard imports where applicable. +* Updated the save functions to no longer rely on the current working directory to save things. The module now does what it can to use hard pathing so that if you spontaneously change working directory it will not cause problems. This should also allow for saving to be threaded, if I am correct. +* [[TeamMsgExtractor #197](https://github.com/TeamMsgExtractor/msg-extractor/issues/197)] Added new property `Message.defaultFolderName`. This property returns the default name to be used for a Message if none of the options change the name. +* [[TeamMsgExtractor #201](https://github.com/TeamMsgExtractor/msg-extractor/issues/201)] Fixed an issue where if the class type was all caps it would not be recognized. According to the documentation the comparisons should have been case insensitive, but I must have misread it at some point. +* [[TeamMsgExtractor #202](https://github.com/TeamMsgExtractor/msg-extractor/issues/202)] Module will now handle path lengths in a semi-intelligent way to determine how best to save the MSG files. Default path length max is 255. +* [[TeamMsgExtractor #203](https://github.com/TeamMsgExtractor/msg-extractor/issues/203)] Fixed an issue where having multiple "." characters in your file name would cause the directories to be incorrectly named when using the `useFileName` (now `useMsgFilename`) argument in the save function. +* [[TeamMsgExtractor #204](https://github.com/TeamMsgExtractor/msg-extractor/issues/204)] Fixed an issue where the failsafe name used by attachments wasn't being encoded before hand causing encoding errors. +* MSG files with a type of simply `IPM` will now be returned as `MSGFile` by `openMsg`, as this specifies that no format has been specified. +* [[TeamMsgExtractor #214](https://github.com/TeamMsgExtractor/msg-extractor/issues/214)] Attachments that error because the MSG class type wasn't recognized or isn't supported will now correctly be `UnsupportedAttachment` instead of `BrokenAttachment`. +* Improved internal code in many functions to make them faster and more efficient. +* `openMsg` will now tell you if a class type is simply unsupported rather than unrecognized. If it is found in the list, the function will raise `UnsupportedMSGTypeError`. +* Added caching to `MSGFile.listDir`. I found that if you have larger files this single function might be taking up over half of the processing time because of how many times it is used in the module. +* Fully implemented raw saving. +* Extended the `Contact` class to have more properties. +* Added new function `MSGFile._ensureSetTyped` which acts like the other ensure set functions but doesn't require you to know the type. Prefer to use other ensure set function when you know exactly what type it will be. +* Changed `Message.saveRaw` to `MSGFile.saveRaw`. +* Changed `MSGFile.saveRaw` to take a path and save the contents to a zip file. +* Corrected the help doc to reflect the current repository (was still on mattgwwalker). +* Fixed a bug that would cause an exception on trying to access the RTF body on a file that didn't have one. This is now correctly returning `None`. +* The `raw` keyword of `Message.save` now actually works. +* Added property `Attachment.randomFilename` which allows you to get the randomly generated name for attachments that don't have a usable one otherwise. +* Added function `Attachment.regenerateRandomName` for creating a new random name if necessary. +* Added function `Attachment.getFilename`. This function is used to get the name an attachment will be saved with given the specified arguments. Arguments are identical to `Attachment.save`. +* Changed pull requests to reflect new style. +* Added additional properties for defined MSG file fields. +* Added zip file support for the `Attachment.save` and `Message.save`. Simply pass a path for the `zip` keyword argument and it will create a new `ZipFile` instance and save all of it's data inside there. Alternatively, you can pass an instance of a class that is either a `ZipFile` or `ZipFile`-like and it will simply use that. When this argument is defined, the `customPath` argument refers to the path inside the zip file. +* Added the `html` and `rtf` keywords to `Message.save`. These will attempt to save the body in the html or rtf format, respectively. If the program cannot save in those formats, it will raise an exception unless the `allowFallback` keyword argument is `True`. +* Changed `utils.hasLen` to use `hasattr` instead of the try-except method it was using. +* Added new option `recipientSeparator` to `MessageBase` allowing you to specify a custom recipient separator (default is ";" to match Microsoft Outlook). +* Changed the `openMsg` function in `Attachment` to not be strict. This allows you to actually open the MSG file even if we don't recognize the type of embedded MSG that is being used. +* Attempted to normalize encoding names throughout the module so that a certain encoding will only show up using one name and not multiple. +* Finally figured out what CRC32 algorithm is used in named properties after directly asking in a Microsoft forum (see the thread [here](https://docs.microsoft.com/en-us/answers/questions/574894/ms-oxmsg-specifies-the-use-of-crc-32-checksums-wit.html)). Fortunately the is already defined in the `compressed-rtf` module so we can take advantage of that. +* Reworked `MessageBase._genRecipient` to improve it (because what on earth was that code it was using before?). Variables in the function are now more descriptive. Added comments in several places. +* Many renames to better fit naming convention: + * `dev.setup_dev_logger` to `dev.setupDevLogger`. + * `MSGFile.fix_path` to `MSGFile.fixPath`. + * `MessageBase.save_attachments` to `MessageBase.saveAttachments`. + * `*.Exists` to `exists`. + * `*.ExistsTypedProperty` to `*.existsTypedProperty`. + * `prop.create_prop` to `prop.createProp`. + * `Properties.attachment_count` to `Properties.attachmentCount`. + * `Properties.next_attachment_id` to `Properties.nextAttachmentId`. + * `Properties.next_recipient_id` to `Properties.nextRecipientId`. + * `Properties.recipient_count` to `Properties.recipientCount`. + * `utils.get_command_args` to `utils.getCommandArgs`. + * `utils.get_full_class_name` to `utils.getFullClassName`. + * `utils.get_input` to `utils.getInput`. + * `utils.has_len` to `utils.hasLen`. + * `utils.setup_logging` to `utils.setupLogging`. + * `constants.int_to_data_type` to `constants.intToDataType`. + * `constants.int_to_intelligence` to `constants.intToIntelligence`. + * `constants.int_to_recipient_type` to `constants.intToRecipientType`. + * Misc internal function variables. + **v0.28.7** * Added hex versions of the `MULTIPLE_X_BYTES` constants. * Added `1048` to `constants.MULTIPLE_16_BYTES` diff --git a/README.rst b/README.rst index c7a743f0..02bf1848 100644 --- a/README.rst +++ b/README.rst @@ -180,7 +180,7 @@ Credits `Matthew Walker`_ - Original developer and owner -`Destiny Peterson (The Elemental of Destruction)`_ - Principle programmer, manager, and msg file "expert" +`Destiny Peterson (The Elemental of Destruction)`_ - Co-owner, principle programmer, knows more about msg files than anyone probably should `JP Bourget`_ - Senior programmer, readability and organization expert, secondary manager @@ -197,8 +197,8 @@ And thank you to everyone who has opened an issue and helped us track down those .. |License: GPL v3| image:: https://img.shields.io/badge/License-GPLv3-blue.svg :target: LICENSE.txt -.. |PyPI3| image:: https://img.shields.io/badge/pypi-0.28.7-blue.svg - :target: https://pypi.org/project/extract-msg/0.28.7/ +.. |PyPI3| image:: https://img.shields.io/badge/pypi-0.29.0-blue.svg + :target: https://pypi.org/project/extract-msg/0.29.0/ .. |PyPI1| image:: https://img.shields.io/badge/python-2.7+-brightgreen.svg :target: https://www.python.org/downloads/release/python-2715/ diff --git a/extract_msg/__init__.py b/extract_msg/__init__.py index b682b01a..2e35062f 100644 --- a/extract_msg/__init__.py +++ b/extract_msg/__init__.py @@ -27,20 +27,20 @@ # along with this program. If not, see . __author__ = 'Destiny Peterson & Matthew Walker' -__date__ = '2021-03-02' -__version__ = '0.28.7' +__date__ = '2022-01-13' +__version__ = '0.29.0' import logging -from extract_msg import constants -from extract_msg.appointment import Appointment -from extract_msg.attachment import Attachment -from extract_msg.contact import Contact -from extract_msg.exceptions import UnrecognizedMSGTypeError -from extract_msg.message import Message -from extract_msg.message_base import MessageBase -from extract_msg.msg import MSGFile -from extract_msg.prop import create_prop -from extract_msg.properties import Properties -from extract_msg.recipient import Recipient -from extract_msg.utils import openMsg, properHex +from . import constants +from .appointment import Appointment +from .attachment import Attachment +from .contact import Contact +from .exceptions import UnrecognizedMSGTypeError +from .message import Message +from .message_base import MessageBase +from .msg import MSGFile +from .prop import createProp +from .properties import Properties +from .recipient import Recipient +from .utils import openMsg, properHex diff --git a/extract_msg/__main__.py b/extract_msg/__main__.py index 93044e89..57f7e226 100644 --- a/extract_msg/__main__.py +++ b/extract_msg/__main__.py @@ -10,7 +10,7 @@ def main(): # Setup logging to stdout, indicate running from cli CLI_LOGGING = 'extract_msg_cli' - args = utils.get_command_args(sys.argv[1:]) + args = utils.getCommandArgs(sys.argv[1:]) level = logging.INFO if args.verbose else logging.WARNING currentdir = os.getcwdu() # Store this just in case the paths that have been given are relative if args.out_path: @@ -29,17 +29,17 @@ def main(): from extract_msg import validation - val_results = {x[0]: validation.validate(x[0]) for x in args.msgs} + valResults = {x[0]: validation.validate(x[0]) for x in args.msgs} filename = 'validation {}.json'.format(int(time.time())) print('Validation Results:') - pprint.pprint(val_results) + pprint.pprint(valResults) print('These results have been saved to {}'.format(filename)) with open(filename, 'w') as fil: - fil.write(json.dumps(val_results)) - utils.get_input('Press enter to exit...') + fil.write(json.dumps(valResults)) + utils.getInput('Press enter to exit...') else: if not args.dump_stdout: - utils.setup_logging(args.config_path, level, args.log, args.file_logging) + utils.setupLogging(args.config_path, level, args.log, args.file_logging) for x in args.msgs: try: with Message(x[0]) as msg: @@ -48,7 +48,7 @@ def main(): print(msg.body) else: os.chdir(out) - msg.save(toJson = args.json, useFileName = args.use_filename, ContentId = args.cid)#, html = args.html, rtf = args.html, args.allowFallback) + msg.save(json = args.json, useMsgFilename = args.use_filename, contentId = args.cid, html = args.html, rtf = args.html, allowFallback = args.allowFallback) except Exception as e: print("Error with file '" + x[0] + "': " + traceback.format_exc()) diff --git a/extract_msg/appointment.py b/extract_msg/appointment.py index 8454513c..21106ded 100644 --- a/extract_msg/appointment.py +++ b/extract_msg/appointment.py @@ -1,14 +1,15 @@ -from extract_msg import constants -from extract_msg.attachment import Attachment -from extract_msg.message_base import MessageBase +from . import constants +from .attachment import Attachment +from .message_base import MessageBase + class Appointment(MessageBase): """ Parser for Microsoft Outlook Appointment files. """ - def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = None, delayAttachments = False, overrideEncoding = None, attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW): - MessageBase.__init__(self, path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior) + def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = None, delayAttachments = False, overrideEncoding = None, attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW, recipientSeparator = ';'): + MessageBase.__init__(self, path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior, recipientSeparator) @property def appointmentClassType(self): diff --git a/extract_msg/attachment.py b/extract_msg/attachment.py index 6090a06c..7d930872 100644 --- a/extract_msg/attachment.py +++ b/extract_msg/attachment.py @@ -1,13 +1,15 @@ import logging import random import string +import zipfile -from extract_msg import constants -from extract_msg.attachment_base import AttachmentBase -from extract_msg.named import NamedAttachmentProperties -from extract_msg.prop import FixedLengthProp, VariableLengthProp -from extract_msg.properties import Properties -from extract_msg.utils import openMsg, inputToString, prepareFilename, verifyPropertyId, verifyType +from . import constants +from .attachment_base import AttachmentBase +from .compat import os_ as os +from .named import NamedAttachmentProperties +from .prop import FixedLengthProp, VariableLengthProp +from .properties import Properties +from .utils import openMsg, inputToString, prepareFilename, verifyPropertyId, verifyType logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) @@ -29,10 +31,10 @@ def __init__(self, msg, dir_): AttachmentBase.__init__(self, msg, dir_) # Get attachment data - if self.Exists('__substg1.0_37010102'): + if self.exists('__substg1.0_37010102'): self.__type = 'data' self.__data = self._getStream('__substg1.0_37010102') - elif self.Exists('__substg1.0_3701000D'): + elif self.exists('__substg1.0_3701000D'): if (self.props['37050003'].value & 0x7) != 0x5: raise NotImplementedError( 'Current version of extract_msg does not support extraction of containers that are not embedded msg files.') @@ -40,7 +42,7 @@ def __init__(self, msg, dir_): else: self.__prefix = msg.prefixList + [dir_, '__substg1.0_3701000D'] self.__type = 'msg' - self.__data = openMsg(self.msg.path, self.__prefix, self.__class__, overrideEncoding = msg.overrideEncoding, attachmentErrorBehavior = msg.attachmentErrorBehavior) + self.__data = openMsg(self.msg.path, self.__prefix, self.__class__, overrideEncoding = msg.overrideEncoding, attachmentErrorBehavior = msg.attachmentErrorBehavior, strict = False) elif (self.props['37050003'].value & 0x7) == 0x7: # TODO Handling for special attacment type 0x7 self.__type = 'web' @@ -48,16 +50,27 @@ def __init__(self, msg, dir_): else: raise TypeError('Unknown attachment type.') - def save(self, contentId = False, json = False, useFileName = False, raw = False, customPath = None, - customFilename = None):#, html = False, rtf = False, allowFallback = False): - # Check if the user has specified a custom filename + def getFilename(self, **kwargs): + """ + Returns the filename to use for the attachment. + + :param contentId: Use the contentId, if available. + :param customFilename: A custom name to use for the file. + + If the filename starts with "UnknownFilename" then there is no guarentee + that the files will have exactly the same filename. + """ filename = None - if customFilename is not None and customFilename != '': + customFilename = kwargs.get('customFilename') + if customFilename: + # First we need to validate it. If there are invalid characters, this will detect it. + if constants.RE_INVALID_FILENAME_CHARACTERS.search(customFilename): + raise ValueError('Invalid character found in customFilename. Must not contain any of the following characters: \\/:*?"<>|') filename = customFilename else: # If not... # Check if user wants to save the file under the Content-id - if contentId: + if kwargs.get('contentId', False): filename = self.cid # If filename is None at this point, use long filename as first preference if filename is None: @@ -67,32 +80,155 @@ def save(self, contentId = False, json = False, useFileName = False, raw = False filename = self.shortFilename # Otherwise just make something up! if filename is None: - filename = 'UnknownFilename ' + \ - ''.join(random.choice(string.ascii_uppercase + string.digits) - for _ in range(5)) + '.bin' + return self.randomFilename + + return filename + + def regenerateRandomName(self): + """ + Used to regenerate the random filename used if the attachment cannot + find a usable filename. + """ + self.__randomName = inputToString('UnknownFilename ' + \ + ''.join(random.choice(string.ascii_uppercase + string.digits) + for _ in range(5)) + '.bin', 'ascii') + + def save(self, **kwargs): + """ + Saves the attachment data. + + The name of the file is determined by several factors. The first + thing that is checked is if you have provided :param customFileName: + to this function. If you have, that is the name that will be used. + If no custom name has been provided and :param contentId: is True, + the file will be saved using the content ID of the attachment. If + it is not found or :param contentId: is False, the long filename + will be used. If the long filename is not found, the short one will + be used. If after all of this a usable filename has not been found, a + random one will be used (accessible from `Attachment.randomFilename`). + After the name to use has been determined, it will then be shortened to + make sure that it is not more than the value of :param maxNameLength:. + + If you want to save the contents into a ZipFile or similar object, + either pass a path to where you want to create one or pass an instance + to :param zip:. If :param zip: is an instance, :param customPath: will + refer to a location inside the zip file. + """ + # Check if the user has specified a custom filename + filename = self.getFilename(**kwargs) # Someone managed to have a null character here, so let's get rid of that filename = prepareFilename(inputToString(filename, self.msg.stringEncoding)) - if customPath is not None and customPath != '': - if customPath[-1] != '/' or customPath[-1] != '\\': - customPath += '/' - filename = customPath + filename + # Get the maximum name length. + maxNameLength = kwargs.get('maxNameLength', 256) + + # Make sure the filename is not longer than it should be. + if len(filename) > maxNameLength: + name, ext = os.path.splitext(filename) + filename = name[:maxNameLength - len(ext)] + ext + + # Check if we are doing a zip file. + zip = kwargs.get('zip') - if self.__type == "data": - with open(filename, 'wb') as f: + + # ZipFile handling. + if zip: + # If we are doing a zip file, first check that we have been given a path. + if isinstance(zip, constants.STRING): + # If we have a path then we use the zip file. + zip = zipfile.ZipFile(zip, 'a', zipfile.ZIP_DEFLATED) + kwargs['zip'] = zip + createdZip = True + else: + createdZip = False + # Path needs to be done in a special way if we are in a zip file. + customPath = kwargs.get('customPath', '').replace('\\', '/') + customPath += '/' if customPath and customPath[-1] != '/' else '' + # Set the open command to be that of the zip file. + _open = zip.open + # Zip files use w for writing in binary. + mode = 'w' + else: + customPath = os.path.abspath(kwargs.get('customPath', os.getcwdu())).replace('\\', '/') + # Prepare the path. + customPath += '' if customPath.endswith('/') else '/' + mode = 'wb' + _open = open + + fullFilename = customPath + filename + + if self.__type == 'data': + if zip: + name, ext = os.path.splitext(filename) + nameList = zip.namelist() + if fullFilename in nameList: + for i in range(2, 100): + testName = customPath + name + ' (' + str(i) + ')' + ext + if testName not in nameList: + fullFilename = testName + break + else: + # If we couldn't find one that didn't exist. + raise FileExistsError('Could not create the specified file because it already exists ("{}").'.format(fullFilename)) + else: + if os.path.exists(fullFilename): + # Try to split the filename into a name and extention. + name, ext = os.path.splitext(filename) + # Try to add a number to it so that we can save without overwriting. + for i in range(2, 100): + testName = customPath + name + ' (' + str(i) + ')' + ext + if not os.path.exists(testName): + fullFilename = testName + break + else: + # If we couldn't find one that didn't exist. + raise FileExistsError('Could not create the specified file because it already exists ("{}").'.format(fullFilename)) + + with _open(fullFilename, mode) as f: f.write(self.__data) + + # Close the ZipFile if this function created it. + if zip and createdZip: + zip.close() + + return fullFilename else: - self.saveEmbededMessage(contentId, json, useFileName, raw, customPath, customFilename)#, html, rtf, allowFallback) - return filename + self.saveEmbededMessage(**kwargs) + + # Close the ZipFile if this function created it. + if zip and createdZip: + zip.close() + + return self.msg - def saveEmbededMessage(self, contentId = False, json = False, useFileName = False, raw = False, customPath = None, - customFilename = None):#, html = False, rtf = False, allowFallback = False): + + def saveEmbededMessage(self, **kwargs): + """ + Seperate function from save to allow it to easily be overridden by a + subclass. + """ + self.data.save(**kwargs) + + @property + def attachmentEncoding(self): """ - Seperate function from save to allow it to - easily be overridden by a subclass. + The encoding information about the attachment object. Will return + b'*\x86H\x86\xf7\x14\x03\x0b\x01' if encoded in MacBinary format, + otherwise it is unset. """ - self.data.save(json, useFileName, raw, contentId, customPath, customFilename)#, html, rtf, allowFallback) + return self._ensureSet('_attachmentEncoding', '__substg1.0_37020102', False) + + @property + def additionalInformation(self): + """ + The additional information about the attachment. This property MUST be + an empty string if attachmentEncoding is not set. Otherwise it MUST be + set to a string of the format ":CREA:TYPE" where ":CREA" is the + four-letter Macintosh file creator code and ":TYPE" is a four-letter + Macintosh type code. + """ + return self._ensureSet('_additionalInformation', '__substg1.0_370F') @property def cid(self): @@ -101,7 +237,7 @@ def cid(self): """ return self._ensureSet('_cid', '__substg1.0_3712') - contend_id = cid + contendId = cid @property def data(self): @@ -117,6 +253,26 @@ def longFilename(self): """ return self._ensureSet('_longFilename', '__substg1.0_3707') + @property + def randomFilename(self): + """ + Returns the random filename to be used by this attachment. + """ + try: + return self.__randomName + except AttributeError: + self.regenerateRandomName() + return self.__randomName + + @property + def renderingPosition(self): + """ + The offset, in redered characters, to use when rendering the attachment + within the main message text. A value of 0xFFFFFFFF indicates a hidden + attachment that is not to be rendered. + """ + return self._ensureSetProperty('_renderingPosition', '370B0003') + @property def shortFilename(self): """ diff --git a/extract_msg/attachment_base.py b/extract_msg/attachment_base.py index e87efc52..c839011c 100644 --- a/extract_msg/attachment_base.py +++ b/extract_msg/attachment_base.py @@ -1,10 +1,10 @@ import logging -from extract_msg import constants -from extract_msg.named import NamedAttachmentProperties -from extract_msg.prop import FixedLengthProp -from extract_msg.properties import Properties -from extract_msg.utils import verifyPropertyId, verifyType +from . import constants +from .named import NamedAttachmentProperties +from .prop import FixedLengthProp +from .properties import Properties +from .utils import verifyPropertyId, verifyType logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) @@ -74,6 +74,17 @@ def _ensureSetProperty(self, variable, propertyName): setattr(self, variable, value) return value + def _ensureSetTyped(self, variable, _id): + """ + Like the other ensure set functions, but designed for when something could be multiple types (where only one will be present). This way you have no need to set the type, it will be handled for you. + """ + try: + return getattr(self, variable) + except AttributeError: + value = self._getTypedData(_id) + setattr(self, variable, value) + return value + def _getStream(self, filename): return self.__msg._getStream([self.__dir, filename]) @@ -153,11 +164,11 @@ def _getTypedStream(self, filename, _type = None): def _registerNamedProperty(self, entry, _type, name = None): self.__namedProperties.defineProperty(entry, _type, name) - def Exists(self, filename): + def exists(self, filename): """ Checks if stream exists inside the attachment folder. """ - return self.__msg.Exists([self.__dir, filename]) + return self.__msg.exists([self.__dir, filename]) def sExists(self, filename): """ @@ -165,13 +176,13 @@ def sExists(self, filename): """ return self.__msg.sExists([self.__dir, filename]) - def ExistsTypedProperty(self, id, _type = None): + def existsTypedProperty(self, id, _type = None): """ Determines if the stream with the provided id exists. The return of this function is 2 values, the first being a boolean for if anything was found, and the second being how many were found. """ - return self.__msg.ExistsTypedProperty(id, self.__dir, _type, True, self.__props) + return self.__msg.existsTypedProperty(id, self.__dir, _type, True, self.__props) @property def dir(self): diff --git a/extract_msg/constants.py b/extract_msg/constants.py index f906de46..e99244b8 100644 --- a/extract_msg/constants.py +++ b/extract_msg/constants.py @@ -5,11 +5,13 @@ """ import datetime +import re import struct import sys import ebcdic + if sys.version_info[0] >= 3: BYTES = bytes STRING = str @@ -23,6 +25,30 @@ # WHAT YOU ARE DOING! FAILURE TO FOLLOW THIS INSTRUCTION # CAN AND WILL BREAK THIS SCRIPT! +# Regular expresion constants. +RE_INVALID_FILENAME_CHARACTERS = re.compile(r'[\\/:*?"<>|]') +# Regular expression to find the start of the html body. +RE_HTML_BODY_START = re.compile(b']*>') +# Regular expression to find the start of the html body in encapsulated RTF. +# This is used for one of the pattern types that makes life easy. +RE_RTF_ENC_BODY_START_1 = re.compile(br'\{\\\*\\htmltag[0-9]* ?]*>\}') +# Unfortunately, while it would make it easy to find the start of the body in +# terms of the encapsulated HTML, trying to inject directly into this location +# has proven to cause some rendering issues that I'll figure out later. For now +# this is basically the universal start we will try to use. +RE_RTF_BODY_START = re.compile(br'\\lang[0-9]*') +# This is an unrelible one to use as it doesn't have a proper way to verify that +# it will inject in exactly the right place. This is kind of just a "well, let's +# hope this one works" method. +RE_RTF_ENC_BODY_UGLY = re.compile(br']*>[^}]*?\}') +# The following tags are fallbacks that we will try to use, with the higher ones +# having priority. If we can't find any other way, we try these which should +# hopefully always work. +RE_RTF_BODY_FALLBACK_FS = re.compile(br'\\fs[0-9]*[^a-zA-Z]') +RE_RTF_BODY_FALLBACK_F = re.compile(br'\\f[0-9]*[^a-zA-Z]') +RE_RTF_FALLBACK_PLAIN = re.compile(br'\\plain[^a-zA-Z0-9]') + + # Constants used by named.py NUMERICAL_NAMED = 0 STRING_NAMED = 1 @@ -44,6 +70,20 @@ GUID_PSETID_XMLEXTRACTEDENTITIES = '{23239608-685D-4732-9C55-4C95CB4E8E33}' GUID_PSETID_ATTACHMENT = '{96357F7F-59E1-47D0-99A7-46515C183B54}' +# EntryID UID Types. +EUID_PUBLIC_MESSAGE_STORE = b'\x1A\x44\x73\x90\xAA\x66\x11\xCD\x9B\xC8\x00\xAA\x00\x2F\xC4\x5A' +EUID_PUBLIC_MESSAGE_STORE_HEX = '1A447390AA6611CD9BC800AA002FC45A' +EUID_ADDRESS_BOOK_RECIPIENT = b'\xDC\xA7\x40\xC8\xC0\x42\x10\x1A\xB4\xB9\x08\x00\x2B\x2F\xE1\x82' +EUID_ADDRESS_BOOK_RECIPIENT_HEX = 'DCA740C8C042101AB4B908002B2FE182' +EUID_ONE_OFF_RECIPIENT = b'\x81\x2B\x1F\xA4\xBE\xA3\x10\x19\x9D\x6E\x00\xDD\x01\x0F\x54\x02' +EUID_ONE_OFF_RECIPIENT_HEX = '812B1FA4BEA310199D6E00DD010F5402' +# Contact address or personal distribution list recipient. +EUID_CA_OR_PDL_RECIPIENT = b'\xFE\x42\xAA\x0A\x18\xC7\x1A\x10\xE8\x85\x0B\x65\x1C\x24\x00\x00' +EUID_CA_OR_PDL_RECIPIENT_HEX = 'FE42AA0A18C71A10E8850B651C240000' +EUID_NNTP_NEWSGROUP_FOLDER = b'\x38\xA1\xBB\x10\x05\xE5\x10\x1A\xA1\xBB\x08\x00\x2B\x2A\x56\xC2' +EUID_NNTP_NEWSGROUP_FOLDER_HEX = '38A1BB1005E5101AA1BB08002B2A56C2' + + FIXED_LENGTH_PROPS = ( 0x0000, 0x0001, @@ -164,13 +204,124 @@ 0x1048, ) +# This is the header that will be injected into the html after being formatted +# with the applicable data. Used entiries are `date`, `sender`, `to`, `subject`, +# `cc`, `bcc` +HTML_INJECTABLE_HEADER = """ +
+
+

+ From: {sender}
+ Sent: {date}
+ To: {to}
+ Cc: {cc}
+ Bcc: {bcc}
+ Subject: {subject} + +

+
+
+""".replace(' ', '').replace('\r', '').replace('\n', '') + +# The header to be used for RTF files with encapsulated HTML. Uses the same +# properties as the HTML header. +# I'm just going to appologize in advance for how bad this looks. RTF in general +# is just not pretty to look at, and the garbage I had to do here didn't help. +# FYI, < and > will need to be sanitized if you actually want it to be properly +# compatible. "<" will become "{\*\htmltag84 <}\htmlrtf <\htmlrtf0" and ">" +# will become "{\*\htmltag84 >}\htmlrtf >\htmlrtf0". +RTF_ENC_INJECTABLE_HEADER = r""" +{{ +{{\*\htmltag96
}} +{{\*\htmltag96
}} +{{\*\htmltag64

}} + +\htmlrtf {{\b\htmlrtf0 +{{\*\htmltag84 }} +From: {{\*\htmltag92 }} +\htmlrtf \b0\htmlrtf0 {sender} +\htmlrtf }}\htmlrtf0 +{{\*\htmltag116
}} +\htmlrtf \line\htmlrtf0 + +\htmlrtf {{\b\htmlrtf0 +{{\*\htmltag84 }} +Sent: {{\*\htmltag92 }} +\htmlrtf \b0\htmlrtf0 {date} +\htmlrtf }}\htmlrtf0 +{{\*\htmltag116
}} +\htmlrtf \line\htmlrtf0 + +\htmlrtf {{\b\htmlrtf0 +{{\*\htmltag84 }} +Cc: {{\*\htmltag92 }} +\htmlrtf \b0\htmlrtf0 {cc} +\htmlrtf }}\htmlrtf0 +{{\*\htmltag116
}} +\htmlrtf \line\htmlrtf0 + +\htmlrtf {{\b\htmlrtf0 +{{\*\htmltag84 }} +Bcc: {{\*\htmltag92 }} +\htmlrtf \b0\htmlrtf0 {bcc} +\htmlrtf }}\htmlrtf0 +{{\*\htmltag116
}} +\htmlrtf \line\htmlrtf0 + +\htmlrtf {{\b\htmlrtf0 +{{\*\htmltag84 }} +Subject: {{\*\htmltag92 }} +\htmlrtf \b0\htmlrtf0 {subject} +\htmlrtf }}\htmlrtf0 +{{\*\htmltag244 }} +{{\*\htmlrag252 }} +\htmlrtf \par\par\htmlrtf0 + +{{\*\htmltag72

}} +{{\*\htmltag104
}} +{{\*\htmltag104
}} +\htmlrtf }}\htmlrtf0 +""".replace('\r', '').replace('\n', '') + +# The header to be used for plain RTF files. Uses the same properties as the +# HTML header. +RTF_PLAIN_INJECTABLE_HEADER = r""" +{{ + {{\b From: \b0 {sender}}}\line + {{\b Sent: \b0 {date}}}\line + {{\b To: \b0 {to}}}\line + {{\b Cc: \b0 {cc}}}\line + {{\b Bcc: \b0 {bcc}}}\line + {{\b Subject: \b0 {subject}}}\par\par +}} +""".replace(' ', '').replace('\r', '').replace('\n', '') + + +KNOWN_CLASS_TYPES = ( + 'ipm.activity', + 'ipm.appointment', + 'ipm.contact', + 'ipm.distlist', + 'ipm.document', + 'ipm.ole.class', + 'ipm.note', + 'ipm.post', + 'ipm.stickynote', + 'ipm.recall.report', + 'ipm.report', + 'ipm.resend', + 'ipm.schedule', + 'ipm.task', + 'ipm.taskrequest' + 'report', +) # This is a dictionary matching the code page number to it's encoding name. # The list used to make this can be found here: # https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers ### TODO: # Many of these code pages are not supported by Python. As such, we should -# Really implement them ourselves to make sure that if someone wants to use an +# really implement them ourselves to make sure that if someone wants to use an # msg file with one of those encodings, they are able to. Perhaps we should # create a seperate module for that? # Code pages that currently don't have a supported encoding will be preceded by @@ -459,7 +610,7 @@ '--out-name', ] MAINDOC = "extract_msg:\n\tExtracts emails and attachments saved in Microsoft Outlook's .msg files.\n\n" \ - "https://github.com/mattgwwalker/msg-extractor" + "https://github.com/TeamMsgExtractor/msg-extractor" # Define pre-compiled structs to make unpacking slightly faster # General structs @@ -477,6 +628,7 @@ STFIX = struct.Struct('<8x8s') STVAR = struct.Struct('<8xi4s') # Structs to help with email type to python type conversions +STI8 = struct.Struct('b') +ST_BE_I16 = struct.Struct('>h') +ST_BE_I32 = struct.Struct('>i') +ST_BE_I64 = struct.Struct('>q') +ST_BE_UI8 = struct.Struct('>B') +ST_BE_UI16 = struct.Struct('>H') +ST_BE_UI32 = struct.Struct('>I') +ST_BE_UI64 = struct.Struct('>Q') +ST_BE_F32 = struct.Struct('>f') +ST_BE_F64 = struct.Struct('>d') PTYPES = { 0x0000: 'PtypUnspecified', diff --git a/extract_msg/contact.py b/extract_msg/contact.py index 9bfb37f1..5b4e776f 100644 --- a/extract_msg/contact.py +++ b/extract_msg/contact.py @@ -1,6 +1,6 @@ -from extract_msg import constants -from extract_msg.attachment import Attachment -from extract_msg.msg import MSGFile +from . import constants +from .attachment import Attachment +from .msg import MSGFile class Contact(MSGFile): @@ -12,6 +12,20 @@ def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = N MSGFile.__init__(self, path, prefix, attachmentClass, filename, overrideEncoding, attachmentErrorBehavior) self.named + @property + def birthday(self): + """ + The birthday of the contact. + """ + return self._ensureSetProperty('_birthday', '3A420040') + + @property + def businessFax(self): + """ + Contains the number of the contact's business fax. + """ + return self._ensureSet('_businessFax', '__substg1.0_3A24') + @property def businessPhone(self): """ @@ -19,6 +33,49 @@ def businessPhone(self): """ return self._ensureSet('_businessPhone', '__substg1.0_3A08') + @property + def businessPhone2(self): + """ + Contains the second number or numbers of the contact's + business. + """ + return self._ensureSetTyped('_businessPhone2', '3A1B') + + @property + def businessUrl(self): + """ + Contains the url of the homepage of the contact's business. + """ + return self._ensureSet('_businessPhone', '__substg1.0_3A51') + + @property + def callbackPhone(self): + """ + Contains the number of the contact's car phone. + """ + return self._ensureSet('_carPhone', '__substg1.0_3A1E') + + @property + def callbackPhone(self): + """ + Contains the contact's callback phone number. + """ + return self._ensureSet('_callbackPhone', '__substg1.0_3A1E') + + @property + def carPhone(self): + """ + Contains the number of the contact's car phone. + """ + return self._ensureSet('_carPhone', '__substg1.0_3A1E') + + @property + def companyMainPhone(self): + """ + Contains the number of the main phone of the contact's company. + """ + return self._ensureSet('_businessPhone', '__substg1.0_3A57') + @property def companyName(self): """ @@ -69,6 +126,13 @@ def initials(self): """ return self._ensureSet('_initials', '__substg1.0_3A0A') + @property + def instantMessagingAddress(self): + """ + The instant messaging address of the contact. + """ + return self._ensureSetNamed('_instantMessagingAddress', '8062') + @property def jobTitle(self): """ @@ -111,9 +175,23 @@ def mobilePhone(self): """ return self._ensureSet('_mobilePhone', '__substg1.0_3A1C') + @property + def spouseName(self): + """ + The name of the contact's spouse. + """ + return self._ensureSet('_spouseName', '__substg1.0_3A') + @property def state(self): """ The state or province that the contact lives in. """ return self._ensureSet('_state', '__substg1.0_3A28') + + @property + def workAddress(self): + """ + The + """ + return self._ensureSetNamed('_workAddress', '801B') diff --git a/extract_msg/data.py b/extract_msg/data.py index e47050b5..c161f023 100644 --- a/extract_msg/data.py +++ b/extract_msg/data.py @@ -2,7 +2,7 @@ Various small data structures used in msg extractor. """ -from extract_msg import constants +from . import constants class PermanentEntryID(object): diff --git a/extract_msg/dev.py b/extract_msg/dev.py index c5cf52b8..5973fa16 100644 --- a/extract_msg/dev.py +++ b/extract_msg/dev.py @@ -11,18 +11,18 @@ import logging -from extract_msg import dev_classes -from extract_msg import utils -from extract_msg.compat import os_ as os -from extract_msg.message import Message +from . import dev_classes +from . import utils +from .compat import os_ as os +from .message import Message logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) -def setup_dev_logger(default_path=None, logfile = None, env_key='EXTRACT_MSG_LOG_CFG'): - utils.setup_logging(default_path, 5, logfile, True, env_key) +def setupDevLogger(defaultPath=None, logfile = None, envKey='EXTRACT_MSG_LOG_CFG'): + utils.setupLogging(defaultPath, 5, logfile, True, envKey) def main(args, argv): @@ -33,7 +33,7 @@ def main(args, argv): the list of arguments that were the input to the aforementioned function. """ - setup_dev_logger(args.config_path, args.log) + setupDevLogger(args.config_path, args.log) currentdir = os.getcwdu() # Store this just in case the paths that have been given are relative if args.out_path: if not os.path.exists(args.out_path): diff --git a/extract_msg/dev_classes/__init__.py b/extract_msg/dev_classes/__init__.py index e1e11c00..d2dc6fd2 100644 --- a/extract_msg/dev_classes/__init__.py +++ b/extract_msg/dev_classes/__init__.py @@ -1,2 +1,2 @@ -from extract_msg.dev_classes.attachment import Attachment -from extract_msg.dev_classes.message import Message +from .attachment import Attachment +from .message import Message diff --git a/extract_msg/dev_classes/attachment.py b/extract_msg/dev_classes/attachment.py index 1ee5005c..92bf984d 100644 --- a/extract_msg/dev_classes/attachment.py +++ b/extract_msg/dev_classes/attachment.py @@ -1,13 +1,13 @@ import logging -from extract_msg import constants -from extract_msg.properties import Properties -from extract_msg.utils import properHex +from .. import constants +from ..properties import Properties +from ..utils import properHex + logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) - class Attachment(object): """ Developer version of the `extract_msg.attachment.Attachment` class. @@ -25,10 +25,10 @@ def __init__(self, msg, dir_): constants.TYPE_ATTACHMENT) # Get attachment data - if msg.Exists([dir_, '__substg1.0_37010102']): + if msg.exists([dir_, '__substg1.0_37010102']): self.__type = 'data' self.__data = msg._getStream([dir_, '__substg1.0_37010102']) - elif msg.Exists([dir_, '__substg1.0_3701000D']): + elif msg.exists([dir_, '__substg1.0_3701000D']): if (self.__props['37050003'].value & 0x7) != 0x5: logger.log(5, 'Printing details of NotImplementedError...') logger.log(5, 'dir_ = {}'.format(dir_)) diff --git a/extract_msg/dev_classes/message.py b/extract_msg/dev_classes/message.py index bdf29687..c400ba51 100644 --- a/extract_msg/dev_classes/message.py +++ b/extract_msg/dev_classes/message.py @@ -2,16 +2,16 @@ import logging import olefile -from extract_msg import constants -from extract_msg.dev_classes.attachment import Attachment -from extract_msg.properties import Properties -from extract_msg.recipient import Recipient -from extract_msg.utils import has_len, inputToString, windowsUnicode +from .. import constants +from ..dev_classes.attachment import Attachment +from ..properties import Properties +from ..recipient import Recipient +from ..utils import hasLen, inputToString, windowsUnicode + logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) - class Message(olefile.OleFileIO): """ Developer version of the `extract_msg.message.Message` class. @@ -46,7 +46,7 @@ def __init__(self, path, prefix='', filename=None): prefix += '/' self.__prefix = prefix self.__prefixList = prefixl - + if tmp_condition: filename = self._getStringStream(prefixl[:-1] + ['__substg1.0_3001'], prefix=False) if filename is not None: @@ -97,21 +97,21 @@ def _getStringStream(self, filename, prefer='unicode', prefix=True): tmp = self._getStream(filename + '001E', prefix = False) return None if tmp is None else tmp.decode(self.stringEncoding) - def Exists(self, filename): + def exists(self, filename): """ Checks if :param filename: exists in the msg file. """ filename = self.fix_path(filename) return self.exists(filename) - + def sExists(self, filename): """ Checks if string stream :param filename: exists in the msg file. """ filename = self.fix_path(filename) return self.exists(filename + '001F') or self.exists(filename + '001E') - - def fix_path(self, filename, prefix=True): + + def fixPath(self, filename, prefix=True): """ Changes paths so that they have the proper prefix (should :param prefix: be True) and @@ -194,7 +194,7 @@ def date(self): except AttributeError: self._date = self._prop.date return self._date - + @property def mainProperties(self): """ @@ -276,7 +276,7 @@ def stringEncoding(self): # Now, this next line SHOULD work, but it is possible that it might not... self.__stringEncoding = str(enc) return self.__stringEncoding - + @stringEncoding.setter def stringEncoding(self, enc): self.__stringEncoding = enc diff --git a/extract_msg/exceptions.py b/extract_msg/exceptions.py index 05b528dd..a3d26153 100644 --- a/extract_msg/exceptions.py +++ b/extract_msg/exceptions.py @@ -8,7 +8,7 @@ This module contains the set of extract_msg exceptions. """ -# add logger bus +# Add logger bus. logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) @@ -65,6 +65,12 @@ class UnknownTypeError(Exception): """ pass +class UnsupportedMSGTypeError(NotImplementedError): + """ + An exception that is raised when an MSG class is recognized by not + supported. + """ + class UnrecognizedMSGTypeError(TypeError): """ An exception that is raised when the module cannot determine how to properly diff --git a/extract_msg/message.py b/extract_msg/message.py index 05b641ab..02fd323a 100644 --- a/extract_msg/message.py +++ b/extract_msg/message.py @@ -1,20 +1,15 @@ -import email.utils import json import logging -import re +import zipfile -import compressed_rtf from imapclient.imapclient import decode_utf7 -from email.parser import Parser as EmailParser -from extract_msg import constants -from extract_msg.attachment import Attachment -from extract_msg.compat import os_ as os -from extract_msg.exceptions import DataNotFoundError, IncompatibleOptionsError -from extract_msg.message_base import MessageBase -from extract_msg.recipient import Recipient -from extract_msg.utils import addNumToDir, inputToBytes, inputToString, prepareFilename - +from . import constants +from .attachment import Attachment +from .compat import os_ as os +from .exceptions import DataNotFoundError, IncompatibleOptionsError +from .message_base import MessageBase +from .utils import addNumToDir, addNumToZipDir, injectHtmlHeader, injectRtfHeader, inputToBytes, inputToString, makeDirs, prepareFilename logger = logging.getLogger(__name__) @@ -24,9 +19,8 @@ class Message(MessageBase): """ Parser for Microsoft Outlook message files. """ - - def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = None, delayAttachments = False, overrideEncoding = None, attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW): - MessageBase.__init__(self, path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior) + def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = None, delayAttachments = False, overrideEncoding = None, attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW, recipientSeparator = ';'): + MessageBase.__init__(self, path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior, recipientSeparator) def dump(self): """ @@ -38,180 +32,238 @@ def dump(self): print('Body:') print(self.body) - def save(self, toJson = False, useFileName = False, raw = False, ContentId = False, customPath = None, customFilename = None):#, html = False, rtf = False, allowFallback = False): + def getJson(self): + """ + Returns the JSON representation of the Message. + """ + return json.dumps({ + 'from': inputToString(self.sender, 'utf-8'), + 'to': inputToString(self.to, 'utf-8'), + 'cc': inputToString(self.cc, 'utf-8'), + 'bcc': inputToString(self.bcc, 'utf-8'), + 'subject': inputToString(self.subject, 'utf-8'), + 'date': inputToString(self.date, 'utf-8'), + 'body': decode_utf7(self.body), + }) + + def save(self, **kwargs): """ - Saves the message body and attachments found in the message. The body and - attachments are stored in a folder. Setting useFileName to true will mean that - the filename is used as the name of the folder; otherwise, the message's date - and subject are used as the folder name. - Here is the absolute order of prioity for the name of the folder: - 1. customFilename - 2. self.filename if useFileName - 3. {date} {subject} + Saves the message body and attachments found in the message. + + The body and attachments are stored in a folder in the current running + directory unless :param customPath: has been specified. The name of the + folder will be determined by 3 factors. + * If :param customFilename: has been set, the value provided for that + will be used. + * If :param useMsgFilename: has been set, the name of the file used + to create the Message instance will be used. + * If the file name has not been provided or :param useMsgFilename: + has not been set, the name of the folder will be created using the + `defaultFolderName` property. + * :param maxNameLength: will force all file names to be shortened + to fit in the space (with the extension included in the length). If + a number is added to the directory that will not be included in the + length, so it is recommended to plan for up to 5 characters extra + to be a part of the name. Default is 256. + + It should be noted that regardless of the value for maxNameLength, the + name of the file containing the body will always have the name 'message' + followed by the full extension. + + There are several parameters used to determine how the message will be + saved. By default, the message will be saved as plain text. Setting one + of the following parameters to True will change that: + * :param html: will try to output the message in HTML format. + * :param json: will output the message in JSON format. + * :param raw: will output the message in a raw format. + * :param rtf: will output the message in RTF format. + + Usage of more than one formatting parameter will raise an exception. + + Using HTML or RTF will raise an exception if they could not be retrieved + unless you have :param allowFallback: set to True. Fallback will go in + this order, starting at the top most format that is set: + * HTML + * RTF + * Plain text + + If you want to save the contents into a ZipFile or similar object, + either pass a path to where you want to create one or pass an instance + to :param zip:. If :param zip: is an instance, :param customPath: will + refer to a location inside the zip file. + + If you want to save the header, should it be found, set + :param saveHeader: to true. """ - #There are several parameters used to determine how the message will be saved. - #By default, the message will be saved as plain text. Setting one of the - #following parameters to True will change that: - # * :param html: will try to output the message in HTML format. - # * :param json: will output the message in JSON format. - # * :param raw: will output the message in a raw format. - # * :param rtf: will output the message in RTF format. - # - #Usage of more than one formatting parameter will raise an exception. - # - #Using HTML or RTF will raise an exception if they could not be retrieved - #unless you have :param allowFallback: set to True. Fallback will go in this - #order, starting at the top most format that is set: - # * HTML - # * RTF - # * Plain text - #""" - count = 1 if toJson else 0 - #count += 1 if html else 0 - #count += 1 if rtf else 0 - count += 1 if raw else 0 - - if count > 1: + + # Move keyword arguments into variables. + _json = kwargs.get('json', False) + html = kwargs.get('html', False) + rtf = kwargs.get('rtf', False) + raw = kwargs.get('raw', False) + allowFallback = kwargs.get('allowFallback', False) + _zip = kwargs.get('zip') + maxNameLength = kwargs.get('maxNameLength', 256) + + # Variables involved in the save location. + customFilename = kwargs.get('customFilename') + useMsgFilename = kwargs.get('useMsgFilename', False) + #maxPathLength = kwargs.get('maxPathLength', 255) + + # ZipFile handling. + if _zip: + # `raw` and `zip` are incompatible. + if raw: + raise IncompatibleOptionsError('The options `raw` and `zip` are incompatible.') + # If we are doing a zip file, first check that we have been given a path. + if isinstance(_zip, constants.STRING): + # If we have a path then we use the zip file. + _zip = zipfile.ZipFile(_zip, 'a', zipfile.ZIP_DEFLATED) + kwargs['zip'] = _zip + createdZip = True + else: + createdZip = False + # Path needs to be done in a special way if we are in a zip file. + path = kwargs.get('customPath', '').replace('\\', '/') + path += '/' if path and path[-1] != '/' else '' + # Set the open command to be that of the zip file. + _open = _zip.open + # Zip files use w for writing in binary. + mode = 'w' + else: + path = os.path.abspath(kwargs.get('customPath', os.getcwdu())).replace('\\', '/') + # Prepare the path. + path += '/' if path[-1] != '/' else '' + mode = 'wb' + _open = open + + # Reset this for sub save calls. + kwargs['customFilename'] = None + + # Check if incompatible options have been provided in any way. + if _json + html + rtf + raw > 1: raise IncompatibleOptionsError('Only one of the following options may be used at a time: toJson, raw, html, rtf') + # Get the type of line endings. crlf = inputToBytes(self.crlf, 'utf-8') - if customFilename != None and customFilename != '': - dirName = customFilename + # TODO: insert code here that will handle checking all of the msg files to see if the path with overflow. + + if customFilename: + # First we need to validate it. If there are invalid characters, this will detect it. + if constants.RE_INVALID_FILENAME_CHARACTERS.search(customFilename): + raise ValueError('Invalid character found in customFilename. Must not contain any of the following characters: \\/:*?"<>|') + path += customFilename[:maxNameLength] + elif useMsgFilename: + if not self.filename: + raise ValueError(':param useMsgFilename: is only available if you are using an msg file on the disk or have provided a filename.') + # Get the actual name of the file. + filename = os.path.split(self.filename)[1] + # Remove the extensions. + filename = os.path.splitext(filename)[0] + # Prepare the filename by removing any special characters. + filename = prepareFilename(filename) + # Shorted the filename. + filename = filename[:maxNameLength] + # Check to make sure we actually have a filename to use. + if not filename: + raise ValueError('Invalid filename found in self.filename: "{}"'.format(self.filename)) + + # Add the file name to the path. + path += filename[:maxNameLength] else: - if useFileName: - # strip out the extension - if self.filename is not None: - dirName = self.filename.split('/').pop().split('.')[0] - else: - ValueError( - 'Filename must be specified, or path must have been an actual path, to save using filename') - else: - # Create a directory based on the date and subject of the message - d = self.parsedDate - if d is not None: - dirName = '{0:02d}-{1:02d}-{2:02d}_{3:02d}{4:02d}'.format(*d) - else: - dirName = 'UnknownDate' + path += self.defaultFolderName[:maxNameLength] - if self.subject is None: - subject = '[No subject]' + # Create the folders. + if not zip: + try: + makeDirs(path) + except Exception: + newDirName = addNumToDir(path) + if newDirName: + path = newDirName else: - subject = prepareFilename(self.subject) + raise Exception( + 'Failed to create directory "%s". Does it already exist?' % + path + ) + else: + # In my testing I ended up with multiple files in a zip at the same + # location so let's try to handle that. + if any(x.startswith(path.rstrip('/') + '/') for x in _zip.namelist()): + path = newDirName = addNumToZipDir(path, _zip) - dirName = dirName + ' ' + subject + # Prepare the path one last time. + path += '/' if path[-1] != '/' else '' - if customPath != None and customPath != '': - if customPath[-1] != '/' or customPath[-1] != '\\': - customPath += '/' - dirName = customPath + dirName - try: - os.makedirs(dirName) - except Exception: - newDirName = addNumToDir(dirName) - if newDirName is not None: - dirName = newDirName - else: - raise Exception( - "Failed to create directory '%s'. Does it already exist?" % - dirName - ) + # Update the kwargs. + kwargs['customPath'] = path - oldDir = os.getcwdu() - try: - os.chdir(dirName) - attachmentNames = [] - # Save the attachments - for attachment in self.attachments: - attachmentNames.append(attachment.save(ContentId, toJson, useFileName, raw))#, html = html, rtf = rtf, allowFallback = allowFallback)) + if raw: + self.saveRaw(path) + return self - # Save the message body - fext = 'json' if toJson else 'txt' + try: + # Check whether we should be using HTML or RTF. + fext = 'txt' useHtml = False useRtf = False - #if html: - # if self.htmlBody is not None: - # useHtml = True - # fext = 'html' - #elif not allowFallback: - # raise DataNotFoundError('Could not find the htmlBody') - - #if rtf or (html and not useHtml): - # if self.rtfBody is not None: - # useRtf = True - # fext = 'rtf' - #elif not allowFallback: - # raise DataNotFoundError('Could not find the rtfBody') - - with open('message.' + fext, 'wb') as f: - if toJson: - emailObj = {'from': inputToString(self.sender, 'utf-8'), - 'to': inputToString(self.to, 'utf-8'), - 'cc': inputToString(self.cc, 'utf-8'), - 'subject': inputToString(self.subject, 'utf-8'), - 'date': inputToString(self.date, 'utf-8'), - 'attachments': attachmentNames, - 'body': decode_utf7(self.body)} + if html: + if self.htmlBody: + useHtml = True + fext = 'html' + elif not allowFallback: + raise DataNotFoundError('Could not find the htmlBody') + + if rtf or (html and not useHtml): + if self.rtfBody: + useRtf = True + fext = 'rtf' + elif not allowFallback: + raise DataNotFoundError('Could not find the rtfBody') + + # Save the attachments. + attachmentNames = [attachment.save(**kwargs) for attachment in self.attachments] + + # Determine the extension to use for the body. + fext = 'json' if _json else fext + + with _open(path + 'message.' + fext, mode) as f: + if _json: + emailObj = json.loads(self.getJson()) + emailObj['attachments'] = attachmentNames f.write(inputToBytes(json.dumps(emailObj), 'utf-8')) else: if useHtml: - # Do stuff - pass + # Inject the header into the data and then write it to + # the file. + data = injectHtmlHeader(self) + f.write(data) elif useRtf: - # Do stuff - pass + # Inject the header into the data and then write it to + # the file. + data = injectRtfHeader(self) + f.write(data) else: f.write(b'From: ' + inputToBytes(self.sender, 'utf-8') + crlf) f.write(b'To: ' + inputToBytes(self.to, 'utf-8') + crlf) - f.write(b'CC: ' + inputToBytes(self.cc, 'utf-8') + crlf) + f.write(b'Cc: ' + inputToBytes(self.cc, 'utf-8') + crlf) + f.write(b'Bcc: ' + inputToBytes(self.bcc, 'utf-8') + crlf) f.write(b'Subject: ' + inputToBytes(self.subject, 'utf-8') + crlf) f.write(b'Date: ' + inputToBytes(self.date, 'utf-8') + crlf) f.write(b'-----------------' + crlf + crlf) f.write(inputToBytes(self.body, 'utf-8')) - except Exception as e: - self.saveRaw() + except Exception: + if not zip: + self.saveRaw(path) raise - finally: - # Return to previous directory - os.chdir(oldDir) + # Close the ZipFile if this function created it. + if zip and createdZip: + zip.close() # Return the instance so that functions can easily be chained. return self - - def saveRaw(self): - # Create a 'raw' folder - oldDir = os.getcwdu() - try: - rawDir = 'raw' - os.makedirs(rawDir) - os.chdir(rawDir) - sysRawDir = os.getcwdu() - - # Loop through all the directories - for dir_ in self.listdir(): - sysdir = '/'.join(dir_) - code = dir_[-1][-8:] - if code in constants.PROPERTIES: - sysdir = sysdir + ' - ' + constants.PROPERTIES[code] - os.makedirs(sysdir) - os.chdir(sysdir) - - # Generate appropriate filename - if dir_[-1].endswith('001E'): - filename = 'contents.txt' - else: - filename = 'contents' - - # Save contents of directory - with open(filename, 'wb') as f: - f.write(self._getStream(dir_)) - - # Return to base directory - os.chdir(sysRawDir) - - finally: - os.chdir(oldDir) diff --git a/extract_msg/message_base.py b/extract_msg/message_base.py index a79a8123..f05aa8e1 100644 --- a/extract_msg/message_base.py +++ b/extract_msg/message_base.py @@ -1,20 +1,18 @@ import email.utils -import json import logging import re import compressed_rtf -from imapclient.imapclient import decode_utf7 +from . import constants +from .attachment import Attachment, BrokenAttachment, UnsupportedAttachment +from .compat import os_ as os +from .exceptions import UnrecognizedMSGTypeError +from .msg import MSGFile +from .recipient import Recipient +from .utils import addNumToDir, inputToBytes, inputToString, prepareFilename from email.parser import Parser as EmailParser -from extract_msg import constants -from extract_msg.attachment import Attachment, BrokenAttachment, UnsupportedAttachment -from extract_msg.compat import os_ as os -from extract_msg.msg import MSGFile -from extract_msg.recipient import Recipient -from extract_msg.utils import addNumToDir, inputToBytes, inputToString - - +from imapclient.imapclient import decode_utf7 logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) @@ -23,10 +21,9 @@ class MessageBase(MSGFile): """ Base class for Message like msg files. """ - def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = None, delayAttachments = False, overrideEncoding = None, - attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW): + attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW, recipientSeparator = ';'): """ :param path: path to the msg file in the system or is the raw msg file. :param prefix: used for extracting embeded msg files @@ -36,16 +33,24 @@ def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = N will use for attachments. You probably should not change this value unless you know what you are doing. - :param filename: optional, the filename to be used by default when saving. - :param delayAttachments: optional, delays the initialization of attachments - until the user attempts to retrieve them. Allows MSG files with bad - attachments to be initialized so the other data can be retrieved. + :param filename: optional, the filename to be used by default when + saving. + :param delayAttachments: optional, delays the initialization of + attachments until the user attempts to retrieve them. Allows MSG + files with bad attachments to be initialized so the other data can + be retrieved. :param overrideEncoding: optional, an encoding to use instead of the one - specified by the msg file. Do not report encoding errors caused by this. + specified by the msg file. Do not report encoding errors caused by + this. + :param attachmentErrorBehavior: Optional, the behaviour to use in the event + of an error when parsing the attachments. + :param recipientSeparator: Optional, Separator string to use between + recipients. """ MSGFile.__init__(self, path, prefix, attachmentClass, filename, overrideEncoding, attachmentErrorBehavior) self.__attachmentsDelayed = delayAttachments self.__attachmentsReady = False + self.__recipientSeparator = recipientSeparator # Initialize properties in the order that is least likely to cause bugs. # TODO have each function check for initialization of needed data so these # lines will be unnecessary. @@ -64,35 +69,43 @@ def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = N def _genRecipient(self, recipientType, recipientInt): """ - Returns the specified recipient field + Returns the specified recipient field. """ private = '_' + recipientType try: return getattr(self, private) except AttributeError: - # Check header first value = None + # Check header first. if self.headerInit(): value = self.header[recipientType] - if value is None: + if value: + value = value.replace(',', self.__recipientSeparator) + + # If the header had a blank field or didn't have the field, generate it manually. + if not value: + # Check if the header has initialized. if self.headerInit(): logger.info('Header found, but "{}" is not included. Will be generated from other streams.'.format(recipientType)) - f = [] - for x in self.recipients: - if x.type & 0x0000000f == recipientInt: - f.append(x.formatted) - if len(f) > 0: - st = f[0] - if len(f) > 1: - for x in range(1, len(f)): - st += ', {0}'.format(f[x]) - value = st - if value is not None: + + # Get a list of the recipients of the specified type. + foundRecipients = tuple(recipient.formatted for recipient in self.recipients if recipient.type & 0x0000000f == recipientInt) + + # If we found recipients, join them with the recipient separator and a space. + if len(foundRecipients) > 0: + value = (self.__recipientSeparator + ' ').join(foundRecipients) + + # Code to fix the formatting so it's all a single line. This allows the user to format it themself if they want. + # This should probably be redone to use re or something, but I can do that later. This shouldn't be a huge problem for now. + if value: value = value.replace(' \r\n\t', ' ').replace('\r\n\t ', ' ').replace('\r\n\t', ' ') value = value.replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') while value.find(' ') != -1: value = value.replace(' ', ' ') + + # Set the field in the class. setattr(self, private, value) + return value def _registerNamedProperty(self, entry, _type, name = None): @@ -108,6 +121,7 @@ def _registerNamedProperty(self, entry, _type, name = None): def close(self): try: + # If this throws an AttributeError then we have not loaded the attachments. self._attachments for attachment in self.attachments: if attachment.type == 'msg': @@ -126,12 +140,12 @@ def headerInit(self): except AttributeError: return False - def save_attachments(self, contentId = False, json = False, useFileName = False, raw = False, customPath = None): + def saveAttachments(self, **kwargs): """ Saves only attachments in the same folder. """ for attachment in self.attachments: - attachment.save(contentId, json, useFileName, raw, customPath) + attachment.save(**kwargs) @property def attachments(self): @@ -143,18 +157,18 @@ def attachments(self): except AttributeError: # Get the attachments attachmentDirs = [] - + prefixLen = self.prefixLen for dir_ in self.listDir(False, True): - if dir_[len(self.prefixList)].startswith('__attach') and\ - dir_[len(self.prefixList)] not in attachmentDirs: - attachmentDirs.append(dir_[len(self.prefixList)]) + if dir_[prefixLen].startswith('__attach') and\ + dir_[prefixLen] not in attachmentDirs: + attachmentDirs.append(dir_[prefixLen]) self._attachments = [] for attachmentDir in attachmentDirs: try: self._attachments.append(self.attachmentClass(self, attachmentDir)) - except NotImplementedError as e: + except (NotImplementedError, UnrecognizedMSGTypeError) as e: if self.attachmentErrorBehavior > constants.ATTACHMENT_ERROR_THROW: logger.error('Error processing attachment at {}'.format(attachmentDir)) logger.exception(e) @@ -252,6 +266,22 @@ def date(self): self._date = self._prop.date return self._date + @property + def defaultFolderName(self): + """ + Generates the default name of the save folder. + """ + try: + return self._defaultFolderName + except AttributeError: + d = self.parsedDate + + dirName = '{0:02d}-{1:02d}-{2:02d}_{3:02d}{4:02d}'.format(*d) if d else 'UnknownDate' + dirName += ' ' + (prepareFilename(self.subject) if self.subject else '[No subject]') + + self._defaultFolderName = dirName + return dirName + @property def header(self): """ @@ -271,6 +301,7 @@ def header(self): header.add_header('From', self.sender) header.add_header('To', self.to) header.add_header('Cc', self.cc) + header.add_header('Bcc', self.bcc) header.add_header('Message-Id', self.messageId) # TODO find authentication results outside of header header.add_header('Authentication-Results', None) @@ -333,6 +364,10 @@ def messageId(self): def parsedDate(self): return email.utils.parsedate(self.date) + @property + def recipientSeparator(self): + return self.__recipientSeparator + @property def recipients(self): """ @@ -343,11 +378,11 @@ def recipients(self): except AttributeError: # Get the recipients recipientDirs = [] - + prefixLen = self.prefixLen for dir_ in self.listDir(): - if dir_[len(self.prefixList)].startswith('__recip') and\ - dir_[len(self.prefixList)] not in recipientDirs: - recipientDirs.append(dir_[len(self.prefixList)]) + if dir_[prefixLen].startswith('__recip') and\ + dir_[prefixLen] not in recipientDirs: + recipientDirs.append(dir_[prefixLen]) self._recipients = [] @@ -361,7 +396,11 @@ def rtfBody(self): """ Returns the decompressed Rtf body from the message. """ - return compressed_rtf.decompress(self.compressedRtf) + try: + return self._rtfBody + except AttributeError: + self._rtfBody = compressed_rtf.decompress(self.compressedRtf) if self.compressedRtf else None + return self._rtfBody @property def sender(self): diff --git a/extract_msg/msg.py b/extract_msg/msg.py index 9a71922e..477d6b95 100644 --- a/extract_msg/msg.py +++ b/extract_msg/msg.py @@ -1,17 +1,19 @@ import codecs import copy import logging +import sys +import zipfile import olefile -from extract_msg import constants -from extract_msg.attachment import Attachment -from extract_msg.named import Named -from extract_msg.prop import FixedLengthProp, VariableLengthProp -from extract_msg.properties import Properties -from extract_msg.utils import divide, getEncodingName, has_len, inputToMsgpath, inputToString, msgpathToString, parseType, properHex, verifyPropertyId, verifyType, windowsUnicode -from extract_msg.exceptions import InvalidFileFormatError, MissingEncodingError - +from . import constants +from .attachment import Attachment +from .compat import os_ as os +from .named import Named +from .prop import FixedLengthProp, VariableLengthProp +from .properties import Properties +from .utils import divide, getEncodingName, hasLen, inputToMsgpath, inputToString, makeDirs, msgpathToString, parseType, properHex, verifyPropertyId, verifyType, windowsUnicode +from .exceptions import InvalidFileFormatError, MissingEncodingError logger = logging.getLogger(__name__) @@ -57,8 +59,7 @@ def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = N raise prefixl = [] - tmp_condition = prefix != '' - if tmp_condition: + if prefix: try: prefix = inputToString(prefix, 'utf-8') except: @@ -76,11 +77,12 @@ def __init__(self, path, prefix = '', attachmentClass = Attachment, filename = N prefix += '/' self.__prefix = prefix self.__prefixList = prefixl - if tmp_condition: + self.__prefixLen = len(prefixl) + if prefix: filename = self._getStringStream(prefixl[:-1] + ['__substg1.0_3001'], prefix=False) - if filename is not None: + if filename: self.filename = filename - elif has_len(path): + elif hasLen(path): if len(path) < 1536: self.filename = path else: @@ -131,16 +133,27 @@ def _ensureSetProperty(self, variable, propertyName): setattr(self, variable, value) return value + def _ensureSetTyped(self, variable, _id): + """ + Like the other ensure set functions, but designed for when something could be multiple types (where only one will be present). This way you have no need to set the type, it will be handled for you. + """ + try: + return getattr(self, variable) + except AttributeError: + value = self._getTypedData(_id) + setattr(self, variable, value) + return value + def _getStream(self, filename, prefix = True): """ Gets a binary representation of the requested filename. This should ALWAYS return a bytes object (string in python 2) """ - filename = self.fix_path(filename, prefix) - if self.exists(filename): + filename = self.fixPath(filename, prefix) + if self.exists(filename, False): with self.openstream(filename) as stream: - return stream.read() + return stream.read() or b'' else: logger.info('Stream "{}" was requested but could not be found. Returning `None`.'.format(filename)) return None @@ -157,14 +170,14 @@ def _getStringStream(self, filename, prefix = True): This should ALWAYS return a string (Unicode in python 2) """ - filename = self.fix_path(filename, prefix) + filename = self.fixPath(filename, prefix) if self.areStringsUnicode: return windowsUnicode(self._getStream(filename + '001F', prefix = False)) else: tmp = self._getStream(filename + '001E', prefix = False) return None if tmp is None else tmp.decode(self.stringEncoding) - def _getTypedData(self, id, _type = None, prefix = True): + def _getTypedData(self, _id, _type = None, prefix = True): """ Gets the data for the specified id as the type that it is supposed to be. :param id: MUST be a 4 digit hexadecimal @@ -175,13 +188,13 @@ def _getTypedData(self, id, _type = None, prefix = True): constant FIXED_LENGTH_PROPS_STRING or VARIABLE_LENGTH_PROPS_STRING. """ - verifyPropertyId(id) - id = id.upper() - found, result = self._getTypedStream('__substg1.0_' + id, prefix, _type) + verifyPropertyId(_id) + _id = _id.upper() + found, result = self._getTypedStream('__substg1.0_' + _id, prefix, _type) if found: return result else: - found, result = self._getTypedProperty(id, _type) + found, result = self._getTypedProperty(_id, _type) return result if found else None def _getTypedProperty(self, propertyID, _type = None): @@ -226,7 +239,7 @@ def _getTypedStream(self, filename, prefix = True, _type = None): it could not find the stream specified. """ verifyType(_type) - filename = self.fix_path(filename, prefix) + filename = self.fixPath(filename, prefix) for x in (filename + _type,) if _type is not None else self.slistDir(): if x.startswith(filename) and x.find('-') == -1: contents = self._getStream(x, False) @@ -248,9 +261,9 @@ def _getTypedStream(self, filename, prefix = True, _type = None): else: raise NotImplementedError('The stream specified is of type {}. We don\'t currently understand exactly how this type works. If it is mandatory that you have the contents of this stream, please create an issue labled "NotImplementedError: _getTypedStream {}".'.format(_type, _type)) if _type in ('101F', '101E', '1102'): - if self.Exists(x + '-00000000', False): + if self.exists(x + '-00000000', False): for y in range(streams): - if self.Exists(x + '-' + properHex(y, 8), False): + if self.exists(x + '-' + properHex(y, 8), False): extras.append(self._getStream(x + '-' + properHex(y, 8), False)) elif _type in ('1002', '1003', '1004', '1005', '1007', '1014', '1040', '1048'): extras = divide(contents, (2 if _type in constants.MULTIPLE_2_BYTES else 4 if _type in constants.MULTIPLE_4_BYTES else 8 if _type in constants.MULTIPLE_8_BYTES else 16)) @@ -272,21 +285,21 @@ def debug(self): print('Directory: ' + str(dir_[:-1])) print('Contents: {}'.format(self._getStream(dir_))) - def Exists(self, inp, prefix = True): + def exists(self, inp, prefix = True): """ Checks if :param inp: exists in the msg file. """ - inp = self.fix_path(inp, prefix) - return self.exists(inp) + inp = self.fixPath(inp, prefix) + return olefile.OleFileIO.exists(self, inp) def sExists(self, inp, prefix = True): """ Checks if string stream :param inp: exists in the msg file. """ - inp = self.fix_path(inp, prefix) + inp = self.fixPath(inp, prefix) return self.exists(inp + '001F') or self.exists(inp + '001E') - def ExistsTypedProperty(self, id, location = None, _type = None, prefix = True, propertiesInstance = None): + def existsTypedProperty(self, _id, location = None, _type = None, prefix = True, propertiesInstance = None): """ Determines if the stream with the provided id exists in the location specified. If no location is specified, the root directory is searched. The return of this @@ -296,35 +309,33 @@ def ExistsTypedProperty(self, id, location = None, _type = None, prefix = True, Because of how this function works, any folder that contains it's own "__properties_version1.0" file should have this function called from it's class. """ - verifyPropertyId(id) + verifyPropertyId(_id) verifyType(_type) - id = id.upper() + _id = _id.upper() if propertiesInstance is None: propertiesInstance = self.mainProperties prefixList = self.prefixList if prefix else [] if location is not None: prefixList.append(location) prefixList = inputToMsgpath(prefixList) - usableid = id + _type if _type is not None else id - found_number = 0 - found_streams = [] + usableId = _id + _type if _type else _id + foundNumber = 0 + foundStreams = [] for item in self.listDir(): - if len(item) > len(prefixList): - if item[len(prefixList)].startswith('__substg1.0_' + usableid) and item[len(prefixList)] not in found_streams: - found_number += 1 - found_streams.append(item[len(prefixList)]) + if len(item) > self.__prefixLen: + if item[self.__prefixLen].startswith('__substg1.0_' + usableId) and item[self.__prefixLen] not in foundStreams: + foundNumber += 1 + foundStreams.append(item[self.__prefixLen]) for x in propertiesInstance: - if x.startswith(usableid): - already_found = False - for y in found_streams: + if x.startswith(usableId): + for y in foundStreams: if y.endswith(x): - already_found = True break - if not already_found: - found_number += 1 - return (found_number > 0), found_number + else: + foundNumber += 1 + return (foundNumber > 0), foundNumber - def fix_path(self, inp, prefix = True): + def fixPath(self, inp, prefix = True): """ Changes paths so that they have the proper prefix (should :param prefix: be True) and @@ -339,24 +350,21 @@ def listDir(self, streams = True, storages = False): """ Replacement for OleFileIO.listdir that runs at the current prefix directory. """ - temp = self.listdir(streams, storages) - if self.__prefix == '': - return temp - prefix = self.__prefix.split('/') - if prefix[-1] == '': - prefix.pop() - out = [] - for x in temp: - good = True - if len(x) <= len(prefix): - good = False - if good: - for y in range(len(prefix)): - if x[y] != prefix[y]: - good = False - if good: - out.append(x) - return out + # Get the items from OleFileIO. + try: + return self.__listDirRes + except AttributeError: + temp = self.listdir(streams, storages) + if not self.__prefix: + return temp + prefix = self.__prefix.split('/') + if prefix[-1] == '': + prefix.pop() + + prefixLength = self.__prefixLen + self.__listDirRes = [x for x in temp if len(x) > prefixLength and x[:prefixLength] == prefix] + return self.__listDirRes + def slistDir(self, streams = True, storages = False): """ @@ -368,6 +376,46 @@ def slistDir(self, streams = True, storages = False): def save(self, *args, **kwargs): raise NotImplementedError('Saving is not yet supported for the {} class'.format(self.__class__.__name__)) + def saveRaw(self, path): + # Create a 'raw' folder + path = path.replace('\\', '/') + path += '/' if path[-1] != '/' else '' + # Make the location + makeDirs(path, exist_ok = True) + # Create the zipfile + path += 'raw.zip' + if os.path.exists(path): + raise FileExistsError('File "{}" already exists.'.format(path)) + with zipfile.ZipFile(path, 'w', zipfile.ZIP_DEFLATED) as zfile: + # Loop through all the directories + for dir_ in self.listdir(): + sysdir = '/'.join(dir_) + code = dir_[-1][-8:] + if constants.PROPERTIES.get(code): + sysdir += ' - ' + constants.PROPERTIES[code] + + # Generate appropriate filename + if dir_[-1].endswith('001E') or dir_[-1].endswith('001F'): + filename = 'contents.txt' + else: + filename = 'contents.bin' + + # Save contents of directory + if sys.version_info[0] < 3: + # Python 2 zip files don't seem to actually match the docs, and `open` simply opens in read mode, even though it should be able to open in write mode. + data = self._getStream(dir_) + if data is not None: + zfile.writestr(sysdir + '/' + filename, data, zipfile.ZIP_DEFLATED) + + else: + with zfile.open(sysdir + '/' + filename, 'w') as f: + data = self._getStream(dir_) + # Specifically check for None. If this is bytes we still want to do this line. + # There was actually this weird issue where for some reason data would be bytes + # but then also simultaneously register as None? + if data is not None: + f.write(data) + @property def areStringsUnicode(self): """ @@ -463,6 +511,13 @@ def prefix(self): """ return self.__prefix + @property + def prefixLen(self): + """ + Returns the number of elements in the prefix. + """ + return self.__prefixLen + @property def prefixList(self): """ diff --git a/extract_msg/named.py b/extract_msg/named.py index 72beb926..7cdf0a2c 100644 --- a/extract_msg/named.py +++ b/extract_msg/named.py @@ -1,12 +1,10 @@ import copy import logging import pprint -import zlib - -from extract_msg import constants -from extract_msg.utils import bytesToGuid, divide, properHex, roundUp - +from . import constants +from .utils import bytesToGuid, divide, properHex, roundUp +from compressed_rtf.crc32 import crc32 logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) @@ -16,29 +14,25 @@ class Named(object): def __init__(self, msg): super(Named, self).__init__() self.__msg = msg - guid_stream = self._getStream('__substg1.0_00020102') - entry_stream = self._getStream('__substg1.0_00030102') - names_stream = self._getStream('__substg1.0_00040102') - guid_stream = self._getStream('__substg1.0_00020102', False) if guid_stream is None else guid_stream - entry_stream = self._getStream('__substg1.0_00030102', False) if entry_stream is None else entry_stream - names_stream = self._getStream('__substg1.0_00040102', False) if names_stream is None else names_stream - self.guid_stream = guid_stream - self.entry_stream = entry_stream - self.names_stream = names_stream - guid_stream_length = len(guid_stream) - entry_stream_length = len(entry_stream) - names_stream_length = len(names_stream) - # TODO guid stream parsing - guids = tuple([None, constants.GUID_PS_MAPI, constants.GUID_PS_PUBLIC_STRINGS] + [bytesToGuid(x) for x in divide(guid_stream, 16)]) - # TODO entry_stream parsing + guidStream = self._getStream('__substg1.0_00020102') or self._getStream('__substg1.0_00020102', False) + entryStream = self._getStream('__substg1.0_00030102') or self._getStream('__substg1.0_00030102', False) + namesStream = self._getStream('__substg1.0_00040102') or self._getStream('__substg1.0_00040102', False) + self.guidStream = guidStream + self.entryStream = entryStream + self.namesStream = namesStream + guidStreamLength = len(guidStream) + entryStreamLength = len(entryStream) + namesStreamLength = len(namesStream) + guids = tuple([None, constants.GUID_PS_MAPI, constants.GUID_PS_PUBLIC_STRINGS] + [bytesToGuid(x) for x in divide(guidStream, 16)]) entries = [] - for x in divide(entry_stream, 8): - tmp = constants.STNP_ENT.unpack(x) + for rawStream in divide(entryStream, 8): + tmp = constants.STNP_ENT.unpack(rawStream) entry = { 'id': tmp[0], 'pid': tmp[2], 'guid_index': tmp[1] >> 1, 'pkind': tmp[1] & 1, # 0 if numerical, 1 if string + 'rawStream': rawStream, } entry['guid'] = guids[entry['guid_index']] entries.append(entry) @@ -46,11 +40,11 @@ def __init__(self, msg): # Parse the names stream. names = {} pos = 0 - while pos < names_stream_length: - name_length = constants.STNP_NAM.unpack(names_stream[pos:pos+4])[0] - pos += 4 # Move to the start of the - names[pos - 4] = names_stream[pos:pos+name_length].decode('utf_16_le') # Names are stored in the dictionary as the position they start at - pos += roundUp(name_length, 4) + while pos < namesStreamLength: + nameLength = constants.STNP_NAM.unpack(namesStream[pos:pos+4])[0] + pos += 4 # Move to the start of the entry. + names[pos - 4] = namesStream[pos:pos+nameLength].decode('utf-16-le') # Names are stored in the dictionary as the position they start at. + pos += roundUp(nameLength, 4) self.entries = entries self.__names = names @@ -59,7 +53,7 @@ def __init__(self, msg): for entry in entries: streamID = properHex(0x8000 + entry['pid']) msg._registerNamedProperty(entry, entry['pkind'], names[entry['id']] if entry['pkind'] == constants.STRING_NAMED else None) - if msg.ExistsTypedProperty(streamID): + if msg.existsTypedProperty(streamID): self.__properties.append(StringNamedProperty(entry, names[entry['id']], msg._getTypedData(streamID)) if entry['pkind'] == constants.STRING_NAMED else NumericalNamedProperty(entry, msg._getTypedData(streamID))) self.__propertiesDict = {} for property in self.__properties: @@ -98,11 +92,11 @@ def getNamedValue(self, propertyName): prop = self.getNamed(propertyName) return prop.data if prop is not None else None - def Exists(self, filename): + def exists(self, filename): """ Checks if stream exists inside the named properties folder. """ - return self.__msg.Exists([self.__dir, filename]) + return self.__msg.exists([self.__dir, filename]) def sExists(self, filename): """ @@ -153,7 +147,7 @@ def defineProperty(self, entry, _type, name = None): Informs the class of a named property that needs to be loaded. """ streamID = properHex(0x8000 + entry['pid']).upper() - if self.__attachment.ExistsTypedProperty(streamID)[0]: + if self.__attachment.existsTypedProperty(streamID)[0]: data = self.__attachment._getTypedData(streamID) property = StringNamedProperty(entry, name, data) if _type == constants.STRING_NAMED else NumericalNamedProperty(entry, data) self.__properties.append(property) @@ -183,8 +177,33 @@ def __init__(self, entry, name, data): self.__guidIndex = entry['guid_index'] self.__guid = entry['guid'] self.__namedPropertyID = entry['pid'] - # WARNING From the way the documentation is worded, this SHOULD work, but it doesn't. - self.__streamID = 0x1000 + (zlib.crc32(name.lower().encode('utf-16-le')) ^ (self.__guidIndex << 1 | 1)) % 0x1F + + # Finally got this to be correct after asking about it on a Microsoft + # forum. Apparently it uses the same CRC-32 as the Compressed RTF + # standard does, so we can just use the function defined in the + # compressed-rtf Python module. + # + # First thing to note is that the name should only ever be lowered if it + # is part of the PS_INTERNET_HEADERS property set **AND** it is + # generated by certain versions of Outlook. As such, a little bit of + # additional code will need to run to determine exactly what the stream + # ID should be if it is in that property set. + if self.__guid == constants.GUID_PS_INTERNET_HEADERS: + # To be sure if it needs to be lower the most effective method would + # be to just get the Stream ID and then check if the entry is in + # there. If it isn't, then check the regular case and see. If it is + # not in either... well, we don't use it for anything so it will + # just be a warning, and the Stream ID will be set to 0. + # + # TODO: Unfortunately, doing this will need to be put off until a + # different version, preferably after Python 2 support is removed, + # as this will require restructuring a lot of internal code. For now + # we just assume that it is lowercase. + self.__streamID = 0x1000 + (crc32(name.lower().encode('utf-16-le')) ^ (self.__guidIndex << 1 | 1)) % 0x1F + + else: + # No special logic here to determine what to do. + self.__streamID = 0x1000 + (crc32(name.encode('utf-16-le')) ^ (self.__guidIndex << 1 | 1)) % 0x1F self.__data = data @property @@ -215,6 +234,13 @@ def namedPropertyID(self): """ return self.__namedPropertyID + @property + def rawEntryStream(self): + """ + The raw data used for the entry. + """ + return self.__entry['rawStream'] + @property def streamID(self): """ @@ -237,8 +263,10 @@ def __init__(self, entry, data): self.__propertyID = properHex(entry['id'], 4).upper() self.__guidIndex = entry['guid_index'] self.__namedPropertyID = entry['pid'] + self.__guid = entry['guid'] self.__streamID = 0x1000 + (entry['id'] ^ (self.__guidIndex << 1)) % 0x1F self.__data = data + self.__entry = entry @property def data(self): diff --git a/extract_msg/prop.py b/extract_msg/prop.py index c56fbef8..fdb87f08 100644 --- a/extract_msg/prop.py +++ b/extract_msg/prop.py @@ -1,14 +1,14 @@ import datetime import logging -from extract_msg import constants -from extract_msg.utils import fromTimeStamp, msgEpoch, properHex +from . import constants +from .utils import fromTimeStamp, msgEpoch, properHex logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) -def create_prop(string): +def createProp(string): temp = constants.ST2.unpack(string)[0] if temp in constants.FIXED_LENGTH_PROPS: return FixedLengthProp(string) @@ -130,17 +130,21 @@ def parseType(self, _type, stream): # TODO parsing for this pass elif _type == 0x000B: # PtypBoolean - value = bool(constants.ST3.unpack(value)[0]) + value = constants.ST3.unpack(value)[0] == 1 elif _type == 0x0014: # PtypInteger64 value = constants.STI64.unpack(value)[0] elif _type == 0x0040: # PtypTime try: - value = fromTimeStamp(msgEpoch(constants.ST3.unpack(value)[0])) + rawtime = constants.ST3.unpack(value)[0] + if rawtime != 915151392000000000: + value = fromTimeStamp(msgEpoch(rawtime)) + else: + # Temporarily just set to max time to signify a null date. + value = datetime.datetime.max except Exception as e: logger.exception(e) logger.error('Timestamp value of {} caused an exception. This was probably caused by the time stamp being too far in the future.'.format(msgEpoch(constants.ST3.unpack(value)[0]))) logger.error(self.raw) - value = constants.ST3.unpack(value)[0] elif _type == 0x0048: # PtypGuid # TODO parsing for this pass diff --git a/extract_msg/properties.py b/extract_msg/properties.py index 54a97201..e8932e48 100644 --- a/extract_msg/properties.py +++ b/extract_msg/properties.py @@ -2,9 +2,9 @@ import logging import pprint -from extract_msg import constants -from extract_msg.prop import create_prop -from extract_msg.utils import divide, fromTimeStamp, msgEpoch, properHex +from . import constants +from .prop import createProp +from .utils import divide, fromTimeStamp, msgEpoch, properHex logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) @@ -51,8 +51,8 @@ def __init__(self, stream, type=None, skip=None): skip = 32 streams = divide(self.__stream[skip:], 16) for st in streams: - a = create_prop(st) - self.__props[a.name] = a + prop = createProp(st) + self.__props[prop.name] = prop self.__pl = len(self.__props) def __contains__(self, key): @@ -114,7 +114,7 @@ def values(self): values.__doc__ = dict.values.__doc__ @property - def attachment_count(self): + def attachmentCount(self): if self.__ac is None: raise TypeError('Properties instance must be intelligent and of type TYPE_MESSAGE to get attachment count.') return self.__ac @@ -151,14 +151,14 @@ def intelligence(self): return self.__intel @property - def next_attachment_id(self): + def nextAttachmentId(self): if self.__naid is None: raise TypeError( 'Properties instance must be intelligent and of type TYPE_MESSAGE to get next attachment id.') return self.__naid @property - def next_recipient_id(self): + def nextRecipientId(self): if self.__nrid is None: raise TypeError( 'Properties instance must be intelligent and of type TYPE_MESSAGE to get next recipient id.') @@ -172,7 +172,7 @@ def props(self): return copy.deepcopy(self.__props) @property - def recipient_count(self): + def recipientCount(self): if self.__rc is None: raise TypeError('Properties instance must be intelligent and of type TYPE_MESSAGE to get recipient count.') return self.__rc diff --git a/extract_msg/recipient.py b/extract_msg/recipient.py index ce2498cf..32b90082 100644 --- a/extract_msg/recipient.py +++ b/extract_msg/recipient.py @@ -1,9 +1,9 @@ import logging -from extract_msg import constants -from extract_msg.data import PermanentEntryID -from extract_msg.properties import Properties -from extract_msg.utils import verifyPropertyId, verifyType +from . import constants +from .data import PermanentEntryID +from .properties import Properties +from .utils import verifyPropertyId, verifyType logger = logging.getLogger(__name__) @@ -69,6 +69,17 @@ def _ensureSetProperty(self, variable, propertyName): setattr(self, variable, value) return value + def _ensureSetTyped(self, variable, _id): + """ + Like the other ensure set functions, but designed for when something could be multiple types (where only one will be present). This way you have no need to set the type, it will be handled for you. + """ + try: + return getattr(self, variable) + except AttributeError: + value = self._getTypedData(_id) + setattr(self, variable, value) + return value + def _getStream(self, filename): return self.__msg._getStream([self.__dir, filename]) @@ -145,11 +156,11 @@ def _getTypedStream(self, filename, _type = None): """ self.__msg._getTypedStream(self, [self.__dir, filename], True, _type) - def Exists(self, filename): + def exists(self, filename): """ Checks if stream exists inside the recipient folder. """ - return self.__msg.Exists([self.__dir, filename]) + return self.__msg.exists([self.__dir, filename]) def sExists(self, filename): """ @@ -157,13 +168,13 @@ def sExists(self, filename): """ return self.__msg.sExists([self.__dir, filename]) - def ExistsTypedProperty(self, id, _type = None): + def existsTypedProperty(self, id, _type = None): """ Determines if the stream with the provided id exists. The return of this function is 2 values, the first being a boolean for if anything was found, and the second being how many were found. """ - return self.__msg.ExistsTypedProperty(id, self.__dir, _type, True, self.__props) + return self.__msg.existsTypedProperty(id, self.__dir, _type, True, self.__props) @property def account(self): diff --git a/extract_msg/utils.py b/extract_msg/utils.py index 07e264a9..6e325937 100644 --- a/extract_msg/utils.py +++ b/extract_msg/utils.py @@ -14,16 +14,18 @@ import tzlocal -from extract_msg import constants -from extract_msg.compat import os_ as os -from extract_msg.exceptions import ConversionError, IncompatibleOptionsError, InvaildPropertyIdError, UnknownCodepageError, UnknownTypeError, UnrecognizedMSGTypeError +from . import constants +from .compat import os_ as os +from .exceptions import ConversionError, IncompatibleOptionsError, InvaildPropertyIdError, UnknownCodepageError, UnknownTypeError, UnrecognizedMSGTypeError, UnsupportedMSGTypeError logger = logging.getLogger(__name__) logger.addHandler(logging.NullHandler()) logging.addLevelName(5, 'DEVELOPER') if sys.version_info[0] >= 3: # Python 3 - get_input = input + getInput = input + + makeDirs = os.makedirs def properHex(inp, length = 0): """ @@ -41,10 +43,20 @@ def properHex(inp, length = 0): return a.rjust(length, '0').upper() def windowsUnicode(string): - return str(string, 'utf_16_le') if string is not None else None + return str(string, 'utf-16-le') if string is not None else None + + from html import escape as htmlEscape else: # Python 2 - get_input = raw_input + getInput = raw_input + + def makeDirs(name, mode = 0o0777, exist_ok = False): + try: + os.makedirs(name, mode) + except WindowsError as e: + if exist_ok and e.winerror == 183: # Path exists. + return + raise def properHex(inp, length = 0): """ @@ -65,7 +77,9 @@ def properHex(inp, length = 0): return a.rjust(length, '0').upper() def windowsUnicode(string): - return unicode(string, 'utf_16_le') if string is not None else None + return unicode(string, 'utf-16-le') if string is not None else None + + from cgi import escape as htmlEscape def addNumToDir(dirName): """ @@ -74,12 +88,22 @@ def addNumToDir(dirName): for i in range(2, 100): try: newDirName = dirName + ' (' + str(i) + ')' - os.makedirs(newDirName) + makeDirs(newDirName) return newDirName except Exception as e: pass return None +def addNumToZipDir(dirName, _zip): + """ + Attempt to create the directory with a '(n)' appended. + """ + for i in range(2, 100): + newDirName = dirName + ' (' + str(i) + ')' + if not any(x.startswith(newDirName.rstrip('/') + '/') for x in _zip.namelist()): + return newDirName + return None + def bitwiseAdjust(inp, mask): """ Uses a given mask to adjust the location of bits after an operation like @@ -143,18 +167,10 @@ def divide(string, length): """ return [string[length * x:length * (x + 1)] for x in range(int(ceilDiv(len(string), length)))] -def prepareFilename(filename): - """ - Adjusts :param filename: so that it can succesfully be used as an actual - file name. - """ - # I would use re here, but it tested to be slightly slower than this. - return ''.join(i for i in filename if i not in r'\/:*?"<>|' + '\x00') - def fromTimeStamp(stamp): return datetime.datetime.fromtimestamp(stamp, tzlocal.get_localzone()) -def get_command_args(args): +def getCommandArgs(args): """ Parse command-line arguments """ @@ -193,32 +209,28 @@ def get_command_args(args): parser.add_argument('--dump-stdout', dest='dump_stdout', action='store_true', help='Tells the program to dump the message body (plain text) to stdout. Overrides saving arguments.') # --html - #parser.add_argument('--html', dest='html', action='store_true', - # help='Sets whether the output should be html. If this is not possible, will error.') + parser.add_argument('--html', dest='html', action='store_true', + help='Sets whether the output should be html. If this is not possible, will error.') + # --raw + parser.add_argument('--raw', dest='raw', action='store_true', + help='Sets whether the output should be html. If this is not possible, will error.') # --rtf - #parser.add_argument('--rtf', dest='rtf', action='store_true', - # help='Sets whether the output should be rtf. If this is not possible, will error.') + parser.add_argument('--rtf', dest='rtf', action='store_true', + help='Sets whether the output should be rtf. If this is not possible, will error.') # --allow-fallback - #parser.add_argument('--allow-fallback', dest='allowFallbac', action='store_true', - # help='Tells the program to fallback to a different save type if the selected one is not possible.') + parser.add_argument('--allow-fallback', dest='allowFallbac', action='store_true', + help='Tells the program to fallback to a different save type if the selected one is not possible.') # --out-name NAME - # parser.add_argument('--out-name', dest = 'out_name', - # help = 'Name to be used with saving the file output. Should come immediately after the file name.') + parser.add_argument('--out-name', dest = 'out_name', + help = 'Name to be used with saving the file output. Should come immediately after the file name.') # [msg files] parser.add_argument('msgs', metavar='msg', nargs='+', help='An msg file to be parsed') options = parser.parse_args(args) # Check if more than one of the following arguments has been specified - #valid = 0 - #if options.html: - # valid += 1 - #if options.rtf: - # valid += 1 - #if options.json: - # valid += 1 - #if valid > 1: - # raise IncompatibleOptionsError('Only one of these options may be selected at a time: --html, --rtf, --json') + if options.html + options.rtf + options.json > 1: + raise IncompatibleOptionsError('Only one of these options may be selected at a time: --html, --json, --raw, --rtf') if options.dev or options.file_logging: options.verbose = True @@ -229,8 +241,8 @@ def get_command_args(args): if options.dump_stdout: options.out_path = None options.json = False - #options.rtf = False - #options.html = False + options.rtf = False + options.html = False options.use_filename = False options.cid = False @@ -274,25 +286,165 @@ def getEncodingName(codepage): except LookupError: raise UnsupportedEncodingError('The codepage {} ({}) is not currently supported by your version of Python.'.format(codepage, constants.CODE_PAGES[codepage])) -def get_full_class_name(inp): +def getFullClassName(inp): return inp.__class__.__module__ + '.' + inp.__class__.__name__ -def has_len(obj): +def hasLen(obj): """ Checks if :param obj: has a __len__ attribute. """ - try: - obj.__len__ - return True - except AttributeError: - return False + return hasattr(obj, '__len__') + +def injectHtmlHeader(msgFile): + """ + Returns the HTML body from the MSG file (will check that it has one) with + the HTML header injected into it. + """ + if not hasattr(msgFile, 'htmlBody') or not msgFile.htmlBody: + raise AttributeError('Cannot inject the HTML header without an HTML body attribute.') -def inputToBytes(string_input_var, encoding): - if isinstance(string_input_var, constants.BYTES): - return string_input_var - elif isinstance(string_input_var, constants.STRING): - return string_input_var.encode(encoding) - elif string_input_var is None: + def replace(bodyMarker): + """ + Internal function to replace the body tag with itself plus the header. + """ + return bodyMarker.group() + constants.HTML_INJECTABLE_HEADER.format( + **{ + 'sender': inputToString(htmlEscape(msgFile.sender) if msgFile.sender else '', 'utf-8'), + 'to': inputToString(htmlEscape(msgFile.to) if msgFile.to else '', 'utf-8'), + 'cc': inputToString(htmlEscape(msgFile.cc) if msgFile.cc else '', 'utf-8'), + 'bcc': inputToString(htmlEscape(msgFile.bcc) if msgFile.bcc else '', 'utf-8'), + 'date': inputToString(msgFile.date, 'utf-8'), + 'subject': inputToString(htmlEscape(msgFile.subject), 'utf-8'), + }).encode('utf-8') + + # Use the previously defined function to inject the HTML header. + return constants.RE_HTML_BODY_START.sub(replace, msgFile.htmlBody, 1) + +def injectRtfHeader(msgFile): + """ + Returns the RTF body from the MSG file (will check that it has one) with the + RTF header injected into it. + """ + if not hasattr(msgFile, 'rtfBody') or not msgFile.rtfBody: + raise AttributeError('Cannot inject the RTF header without an RTF body attribute.') + + # Try to determine which header to use. Also determines how to sanitize the + # rtf. + if isEncapsulatedRtf(msgFile.rtfBody): + injectableHeader = constants.RTF_ENC_INJECTABLE_HEADER + def rtfSanitize(inp): + if not inp: + return '' + output = '' + for char in inp: + # Check if it is in the right range to be printed directly. + if 32 <= ord(char) < 128: + if char in ('\\', '{', '}'): + output += '\\' + output += char + elif ord(char) < 32 or 128 <= ord(char) <= 255: + # Otherwise, see if it is just a small escape. + output += "\\'" + properHex(char, 2) + else: + # Handle Unicode characters. + output += '\\u' + str(ord(char)) + '?' + + return output + else: + injectableHeader = constants.RTF_PLAIN_INJECTABLE_HEADER + def rtfSanitize(inp): + if not inp: + return '' + output = '' + for char in inp: + # Check if it is in the right range to be printed directly. + if 32 <= ord(char) < 128: + # Quick check for handling the HTML escapes. Will eventually + # upgrade this code to actually handle all the HTML escapes + # but this will do for now. + if char == '<': + output += r'{\*\htmltag84 <}\htmlrtf <\htmlrtf0 ' + elif char == '>': + output += r'{\*\htmltag84 >}\htmlrtf >\htmlrtf0' + else: + if char in ('\\', '{', '}'): + output += '\\' + output += char + elif ord(char) < 32 or 128 <= ord(char) <= 255: + # Otherwise, see if it is just a small escape. + output += "\\'" + properHex(char, 2) + else: + # Handle Unicode characters. + output += '\\u' + str(ord(char)) + '?' + + return output + + def replace(bodyMarker): + """ + Internal function to replace the body tag with itself plus the header. + """ + return bodyMarker.group() + injectableHeader.format( + **{ + 'sender': inputToString(rtfSanitize(msgFile.sender) if msgFile.sender else '', 'utf-8'), + 'to': inputToString(rtfSanitize(msgFile.to) if msgFile.to else '', 'utf-8'), + 'cc': inputToString(rtfSanitize(msgFile.cc) if msgFile.cc else '', 'utf-8'), + 'bcc': inputToString(rtfSanitize(msgFile.bcc) if msgFile.bcc else '', 'utf-8'), + 'date': inputToString(msgFile.date, 'utf-8'), + 'subject': inputToString(rtfSanitize(msgFile.subject), 'utf-8'), + }).encode('utf-8') + + # Use the previously defined function to inject the RTF header. We are + # trying a few different methods to determine where to place the header. + data = constants.RE_RTF_BODY_START.sub(replace, msgFile.rtfBody, 1) + # If after any method the data does not match the RTF body, then we have + # succeeded. + if data != msgFile.rtfBody: + logger.debug('Successfully injected RTF header using first method.') + return data + + # This second method only applies to encapsulated HTML, so we need to check + # for that first. + if isEncapsulatedRtf(msgFile.rtfBody): + data = constants.RE_RTF_ENC_BODY_START_1.sub(replace, msgFile.rtfBody, 1) + if data != msgFile.rtfBody: + logger.debug('Successfully injected RTF header using second method.') + return data + + # This third method is a lot less reliable, and actually would just + # simply violate the encapuslated html, so for this one we don't even + # try to worry about what the html will think about it. If it injects, + # we swap to basic and then inject again, more worried about it working + # than looking nice inside. + if constants.RE_RTF_ENC_BODY_UGLY.sub(replace, msgFile.rtfBody, 1) != msgFile.rtfBody: + injectableHeader = constants.RTF_PLAIN_INJECTABLE_HEADER + data = constants.RE_RTF_ENC_BODY_UGLY.sub(replace, msgFile.rtfBody, 1) + logger.debug('Successfully injected RTF header using third method.') + return data + + # Severe fallback attempts. + data = constants.RE_RTF_BODY_FALLBACK_FS.sub(replace, msgFile.rtfBody, 1) + if data != msgFile.rtfBody: + logger.debug('Successfully injected RTF header using forth method.') + return data + + data = constants.RE_RTF_BODY_FALLBACK_F.sub(replace, msgFile.rtfBody, 1) + if data != msgFile.rtfBody: + logger.debug('Successfully injected RTF header using fifth method.') + return data + + data = constants.RE_RTF_BODY_FALLBACK_PLAIN.sub(replace, msgFile.rtfBody, 1) + if data != msgFile.rtfBody: + logger.debug('Successfully injected RTF header using sixth method.') + return data + + raise Exception('All injection attempts failed.') + +def inputToBytes(stringInputVar, encoding): + if isinstance(stringInputVar, constants.BYTES): + return stringInputVar + elif isinstance(stringInputVar, constants.STRING): + return stringInputVar.encode(encoding) + elif stringInputVar is None: return b'' else: raise ConversionError('Cannot convert to BYTES type') @@ -306,22 +458,45 @@ def inputToMsgpath(inp): ret = inputToString(inp, 'utf-8').replace('\\', '/').split('/') return ret if ret[0] != '' else [] -def inputToString(bytes_input_var, encoding): - if isinstance(bytes_input_var, constants.STRING): - return bytes_input_var - elif isinstance(bytes_input_var, constants.BYTES): - return bytes_input_var.decode(encoding) - elif bytes_input_var is None: +def inputToString(bytesInputVar, encoding): + if isinstance(bytesInputVar, constants.STRING): + return bytesInputVar + elif isinstance(bytesInputVar, constants.BYTES): + return bytesInputVar.decode(encoding) + elif bytesInputVar is None: return '' else: raise ConversionError('Cannot convert to STRING type') +def isEncapsulatedRtf(inp): + """ + Currently the destection is made to be *extremly* basic, but this will work + for now. In the future this will be fixed to that literal text in the body + of a message won't cause false detection. + """ + return b'\\fromhtml' in inp + def isEmptyString(inp): """ Returns true if the input is None or is an Empty string. """ return (inp == '' or inp is None) +def knownMsgClass(classType): + """ + Checks if the specified class type is recognized by the module. Usually used + for checking if a type is simply unsupported rather than unknown. + """ + classType = classType.lower() + if classType == 'ipm': + return True + + for item in constants.KNOWN_CLASS_TYPES: + if classType.startsWith(item): + return True + + return False + def msgEpoch(inp): """ Taken (with permission) from https://github.com/TheElementalOfDestruction/creatorUtils @@ -339,51 +514,65 @@ def msgpathToString(inp): inp.replace('\\', '/') return inp -def openMsg(path, prefix = '', attachmentClass = None, filename = None, delayAttachments = False, overrideEncoding = None, attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW, strict = True): +def openMsg(path, prefix = '', attachmentClass = None, filename = None, delayAttachments = False, overrideEncoding = None, attachmentErrorBehavior = constants.ATTACHMENT_ERROR_THROW, recipientSeparator = ';', strict = True): """ Function to automatically open an MSG file and detect what type it is. - :param path: path to the msg file in the system or is the raw msg file. - :param prefix: used for extracting embeded msg files + :param path: Path to the msg file in the system or is the raw msg file. + :param prefix: Used for extracting embeded msg files inside the main one. Do not set manually unless you know what you are doing. - :param attachmentClass: optional, the class the Message object + :param attachmentClass: Optional, the class the Message object will use for attachments. You probably should not change this value unless you know what you are doing. - :param filename: optional, the filename to be used by default when saving. - :param delayAttachments: optional, delays the initialization of attachments + :param filename: Optional, the filename to be used by default when saving. + :param delayAttachments: Optional, delays the initialization of attachments until the user attempts to retrieve them. Allows MSG files with bad attachments to be initialized so the other data can be retrieved. + :param overrideEncoding: Optional, overrides the specified encoding of the + MSG file. + :param attachmentErrorBehavior: Optional, the behaviour to use in the event + of an error when parsing the attachments. + :param recipientSeparator: Optional, Separator string to use between + recipients. If :param strict: is set to `True`, this function will raise an exception when it cannot identify what MSGFile derivitive to use. Otherwise, it will log the error and return a basic MSGFile instance. + + Raises UnsupportedMSGTypeError and UnrecognizedMSGTypeError. """ - from extract_msg.appointment import Appointment - from extract_msg.attachment import Attachment - from extract_msg.contact import Contact - from extract_msg.message import Message - from extract_msg.msg import MSGFile + from .appointment import Appointment + from .attachment import Attachment + from .contact import Contact + from .message import Message + from .msg import MSGFile attachmentClass = Attachment if attachmentClass is None else attachmentClass msg = MSGFile(path, prefix, attachmentClass, filename, overrideEncoding, attachmentErrorBehavior) - classtype = msg.classType - if classtype.startswith('IPM.Contact') or classtype.startswith('IPM.DistList'): + # After rechecking the docs, all comparisons should be case-insensitive, not case-sensitive. My reading ability is great. + classType = msg.classType.lower() + if classType.startswith('ipm.contact') or classType.startswith('ipm.distlist'): msg.close() return Contact(path, prefix, attachmentClass, filename, overrideEncoding, attachmentErrorBehavior) - elif classtype.startswith('IPM.Note') or classtype.startswith('REPORT'): + elif classType.startswith('ipm.note') or classType.startswith('report'): msg.close() - return Message(path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior) - elif classtype.startswith('IPM.Appointment') or classtype.startswith('IPM.Schedule'): + return Message(path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior, recipientSeparator) + elif classType.startswith('ipm.appointment') or classType.startswith('ipm.schedule'): msg.close() - return Appointment(path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior) + return Appointment(path, prefix, attachmentClass, filename, delayAttachments, overrideEncoding, attachmentErrorBehavior, recipientSeparator) + elif classType == 'ipm': # Unspecified format. It should be equal to this and not just start with it. + return msg elif strict: + ct = msg.classType msg.close() - raise UnrecognizedMSGTypeError('Could not recognize msg class type "{}". It is recommended you report this to the developers.'.format(msg.classType)) + if knownMsgClass(classType): + raise UnsupportedMSGTypeError('MSG type "{}" currently is not supported by the module. If you would like support, please make a feature request.'.format(ct)) + raise UnrecognizedMSGTypeError('Could not recognize msg class type "{}".'.format(ct)) else: - logger.error('Could not recognize msg class type "{}". It is recommended you report this to the developers.'.format(msg.classType)) + logger.error('Could not recognize msg class type "{}". This most likely means it hasn\'t been implemented yet, and you should ask the developers to add support for it.'.format(msg.classType)) return msg def parseType(_type, stream, encoding, extras): @@ -398,9 +587,9 @@ def parseType(_type, stream, encoding, extras): WARNING: Not done. Do not try to implement anywhere where it is not already implemented """ - # WARNING Not done. Do not try to implement anywhere where it is not already implemented + # WARNING Not done. Do not try to implement anywhere where it is not already implemented. value = stream - length_extras = len(extras) + lengthExtras = len(extras) if _type == 0x0000: # PtypUnspecified pass elif _type == 0x0001: # PtypNull @@ -423,11 +612,11 @@ def parseType(_type, stream, encoding, extras): return constants.PYTPFLOATINGTIME_START + datetime.timedelta(days = value) elif _type == 0x000A: # PtypErrorCode value = constants.STUI32.unpack(value)[0] - # TODO parsing for this + # TODO parsing for this. # I can't actually find any msg properties that use this, so it should be okay to release this function without support for it. raise NotImplementedError('Parsing for type 0x000A has not yet been implmented. If you need this type, please create a new issue labeled "NotImplementedError: parseType 0x000A"') elif _type == 0x000B: # PtypBoolean - return bool(constants.ST3.unpack(value)[0]) + return constants.ST3.unpack(value)[0] == 1 elif _type == 0x000D: # PtypObject/PtypEmbeddedTable # TODO parsing for this # Wait, that's the extension for an attachment folder, so parsing this might not be as easy as we would hope. The function may be released without support for this. @@ -437,9 +626,15 @@ def parseType(_type, stream, encoding, extras): elif _type == 0x001E: # PtypString8 return value.decode(encoding) elif _type == 0x001F: # PtypString - return value.decode('utf_16_le') + return value.decode('utf-16-le') elif _type == 0x0040: # PtypTime - return fromTimeStamp(msgEpoch(constants.ST3.unpack(value)[0])).__format__('%a, %d %b %Y %H:%M:%S %z') + rawtime = constants.ST3.unpack(value)[0] + if rawtime != 915151392000000000: + value = fromTimeStamp(msgEpoch(rawtime)) + else: + # Temporarily just set to max time to signify a null date. + value = datetime.datetime.max + return value elif _type == 0x0048: # PtypGuid return bytesToGuid(value) elif _type == 0x00FB: # PtypServerId @@ -458,9 +653,9 @@ def parseType(_type, stream, encoding, extras): if _type in (0x101F, 0x101E): ret = [x.decode(encoding) for x in extras] lengths = struct.unpack('<{}i'.format(len(ret)), stream) - length_lengths = len(lengths) - if length_lengths > length_extras: - logger.warning('Error while parsing multiple type. Expected {} stream{}, got {}. Ignoring.'.format(length_lengths, 's' if length_lengths > 1 or length_lengths == 0 else '', length_extras)) + lengthLengths = len(lengths) + if lengthLengths > lengthExtras: + logger.warning('Error while parsing multiple type. Expected {} stream{}, got {}. Ignoring.'.format(lengthLengths, 's' if lengthLengths != 1 else '', lengthExtras)) for x, y in enumerate(extras): if lengths[x] != len(y): logger.warning('Error while parsing multiple type. Expected length {}, got {}. Ignoring.'.format(lengths[x], len(y))) @@ -468,9 +663,9 @@ def parseType(_type, stream, encoding, extras): elif _type == 0x1102: ret = copy.deepcopy(extras) lengths = tuple(constants.STUI32.unpack(stream[pos*8:(pos+1)*8])[0] for pos in range(len(stream) // 8)) - length_lengths = len(lengths) - if length_lengths > length_extras: - logger.warning('Error while parsing multiple type. Expected {} stream{}, got {}. Ignoring.'.format(length_lengths, 's' if length_lengths > 1 or length_lengths == 0 else '', length_extras)) + lengthLengths = len(lengths) + if lengthLengths > lengthExtras: + logger.warning('Error while parsing multiple type. Expected {} stream{}, got {}. Ignoring.'.format(lengthLengths, 's' if lengthLengths != 1 else '', lengthExtras)) for x, y in enumerate(extras): if lengths[x] != len(y): logger.warning('Error while parsing multiple type. Expected length {}, got {}. Ignoring.'.format(lengths[x], len(y))) @@ -499,49 +694,57 @@ def parseType(_type, stream, encoding, extras): raise NotImplementedError('Parsing for type {} has not yet been implmented. If you need this type, please create a new issue labeled "NotImplementedError: parseType {}"'.format(_type, _type)) return value +def prepareFilename(filename): + """ + Adjusts :param filename: so that it can succesfully be used as an actual + file name. + """ + # I would use re here, but it tested to be slightly slower than this. + return ''.join(i for i in filename if i not in r'\/:*?"<>|' + '\x00') + def roundUp(inp, mult): """ Rounds :param inp: up to the nearest multiple of :param mult:. """ return inp + (mult - inp) % mult -def setup_logging(default_path=None, default_level=logging.WARN, logfile=None, enable_file_logging=False, +def setupLogging(defaultPath=None, defaultLevel=logging.WARN, logfile=None, enableFileLogging=False, env_key='EXTRACT_MSG_LOG_CFG'): """ Setup logging configuration Args: - default_path (str): Default path to use for the logging configuration file - default_level (int): Default logging level + defaultPath (str): Default path to use for the logging configuration file + defaultLevel (int): Default logging level env_key (str): Environment variable name to search for, for setting logfile path Returns: bool: True if the configuration file was found and applied, False otherwise """ - shipped_config = getContFileDir(__file__) + '/logging-config/' + shippedConfig = getContFileDir(__file__) + '/logging-config/' if os.name == 'nt': null = 'NUL' - shipped_config += 'logging-nt.json' + shippedConfig += 'logging-nt.json' elif os.name == 'posix': null = '/dev/null' - shipped_config += 'logging-posix.json' + shippedConfig += 'logging-posix.json' # Find logging.json if not provided - if not default_path: - default_path = shipped_config + if not defaultPath: + defaultPath = shippedConfig paths = [ - default_path, + defaultPath, 'logging.json', '../logging.json', '../../logging.json', - shipped_config, + shippedConfig, ] path = None - for config_path in paths: - if os.path.exists(config_path): - path = config_path + for configPath in paths: + if os.path.exists(configPath): + path = configPath break value = os.getenv(env_key, None) @@ -550,11 +753,11 @@ def setup_logging(default_path=None, default_level=logging.WARN, logfile=None, e if path is None: print('Unable to find logging.json configuration file') - print('Make sure a valid logging configuration file is referenced in the default_path' + print('Make sure a valid logging configuration file is referenced in the defaultPath' ' argument, is inside the extract_msg install location, or is available at one ' 'of the following file-paths:') print(str(paths[1:])) - logging.basicConfig(level=default_level) + logging.basicConfig(level=defaultLevel) logging.warning('The extract_msg logging configuration was not found - using a basic configuration.' 'Please check the extract_msg installation directory for "logging-{}.json".'.format(os.name)) return False @@ -564,12 +767,12 @@ def setup_logging(default_path=None, default_level=logging.WARN, logfile=None, e for x in config['handlers']: if 'filename' in config['handlers'][x]: - if enable_file_logging: + if enableFileLogging: config['handlers'][x]['filename'] = tmp = os.path.expanduser( os.path.expandvars(logfile if logfile else config['handlers'][x]['filename'])) tmp = getContFileDir(tmp) if not os.path.exists(tmp): - os.makedirs(tmp) + makeDirs(tmp) else: config['handlers'][x]['filename'] = null @@ -579,7 +782,7 @@ def setup_logging(default_path=None, default_level=logging.WARN, logfile=None, e print('Failed to configure the logger. Did your installation get messed up?') print(e) - logging.getLogger().setLevel(default_level) + logging.getLogger().setLevel(defaultLevel) return True def verifyPropertyId(id): diff --git a/extract_msg/validation.py b/extract_msg/validation.py index 2e472198..05b73e24 100644 --- a/extract_msg/validation.py +++ b/extract_msg/validation.py @@ -1,29 +1,29 @@ import olefile -from extract_msg.message import Message -from extract_msg.utils import get_full_class_name, has_len +from .message import Message +from .utils import getFullClassName, hasLen -def get_email_details(instance, stream): +def getEmailDetails(instance, stream): return { 'exists': instance.sExists(stream), 'not empty': False if not instance.sExists(stream) else len(instance._getStringStream(stream)) > 0, 'valid email address': False if not instance.sExists(stream) else u'@' in instance._getStringStream(stream), } -def get_stream_details(instance, stream): +def getStreamDetails(instance, stream): return { - 'exists': instance.Exists(stream), - 'not empty': False if not instance.Exists(stream) else len(instance._getStream(stream)) > 0, + 'exists': instance.exists(stream), + 'not empty': False if not instance.exists(stream) else len(instance._getStream(stream)) > 0, } -def get_string_details(instance, stream): +def getStringDetails(instance, stream): return { 'exists': instance.sExists(stream), 'not empty': False if not instance.sExists(stream) else len(instance._getStringStream(stream)) > 0, } -def string_FE(instance): +def stringFE(instance): temp = '001E' if instance.mainProperties.has_key('340D0003'): temp = '001F' if instance.mainProperties['340D0003'].value & 0x40000 else '001E' @@ -35,58 +35,58 @@ def string_FE(instance): def validate(msg): - validation_dict = { + validationDict = { 'input': { - 'class': get_full_class_name(msg), # Get the full name of the class - 'has_len': has_len(msg), # Does the input have a __len__ attribute? - 'len': len(msg) if has_len(msg) else None, # If input has __len__, put the value here + 'class': getFullClassName(msg), # Get the full name of the class + 'hasLen': hasLen(msg), # Does the input have a __len__ attribute? + 'len': len(msg) if hasLen(msg) else None, # If input has __len__, put the value here }, 'olefile': { 'valid': olefile.isOleFile(msg), }, } - if validation_dict['olefile']['valid']: - validation_dict['message'] = { + if validationDict['olefile']['valid']: + validationDict['message'] = { 'initializes': False, } try: - msg_instance = Message(msg) + msgInstance = Message(msg) except NotImplementedError: # Should we have a special procedure for handling it if we get "not implemented"? pass except: pass else: - validation_dict['message']['initializes'] = True - validation_dict['message']['msg'] = validate_msg(msg_instance) - return validation_dict + validationDict['message']['initializes'] = True + validationDict['message']['msg'] = validateMsg(msgInstance) + return validationDict -def validate_attachment(instance): +def validateAttachment(instance): temp = { - 'long filename': get_string_details(instance, '__substg1.0_3707'), - 'short filename': get_string_details(instance, '__substg1.0_3704'), - 'content id': get_string_details(instance, '__substg1.0_3712'), + 'long filename': getStringDetails(instance, '__substg1.0_3707'), + 'short filename': getStringDetails(instance, '__substg1.0_3704'), + 'content id': getStringDetails(instance, '__substg1.0_3712'), 'type': instance.type, } if temp['type'] == 'msg': - temp['msg'] = validate_msg(instance.data) + temp['msg'] = validateMsg(instance.data) return temp -def validate_msg(instance): +def validateMsg(instance): return { - '001F/001E': string_FE(instance), - 'header': get_string_details(instance, '__substg1.0_007D'), - 'body': get_string_details(instance, '__substg1.0_1000'), - 'html body': get_stream_details(instance, '__substg1.0_10130102'), - 'rtf body': get_stream_details(instance, '__substg1.0_10090102'), + '001F/001E': stringFE(instance), + 'header': getStringDetails(instance, '__substg1.0_007D'), + 'body': getStringDetails(instance, '__substg1.0_1000'), + 'html body': getStreamDetails(instance, '__substg1.0_10130102'), + 'rtf body': getStreamDetails(instance, '__substg1.0_10090102'), 'date': instance.date, - 'attachments': {x: validate_attachment(y) for x, y in enumerate(instance.attachments)}, - 'recipients': {x: validate_recipient(y) for x, y in enumerate(instance.recipients)}, + 'attachments': {x: validateAttachment(y) for x, y in enumerate(instance.attachments)}, + 'recipients': {x: validateRecipient(y) for x, y in enumerate(instance.recipients)}, } -def validate_recipient(instance): +def validateRecipient(instance): return { 'type': instance.type, - 'stream 3003': get_email_details(instance, '__substg1.0_3003'), - 'stream 39FE': get_email_details(instance, '__substg1.0_39FE'), + 'stream 3003': getEmailDetails(instance, '__substg1.0_3003'), + 'stream 39FE': getEmailDetails(instance, '__substg1.0_39FE'), } diff --git a/tests.py b/tests.py index a9d14394..a2db26fd 100644 --- a/tests.py +++ b/tests.py @@ -18,7 +18,7 @@ def setUp(self): tearDown = setUp def test_message(self): - msg = base.Message(TEST_FILE) + msg = extract_msg.Message(TEST_FILE) self.assertEqual(msg.subject, u'Test for TIF files') self.assertEqual( msg.body, @@ -35,7 +35,7 @@ def test_message(self): msg.debug() def test_save(self): - msg = base.Message(TEST_FILE) + msg = extract_msg.Message(TEST_FILE) msg.save() self.assertEqual( sorted(os.listdir('2013-11-18_1026 Test for TIF files')), @@ -44,7 +44,7 @@ def test_save(self): msg.saveRaw() def test_saveRaw(self): - msg = base.Message(TEST_FILE) + msg = extract_msg.Message(TEST_FILE) msg.saveRaw() assert os.listdir('raw')