Skip to content

Commit

Permalink
Merge pull request #40 from Syncurity/75/fix-string-function
Browse files Browse the repository at this point in the history
75/fix string function
  • Loading branch information
punkrokk authored Apr 18, 2019
2 parents 559889a + c406611 commit 579a02a
Show file tree
Hide file tree
Showing 5 changed files with 108 additions and 60 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
**v0.23.0**
* [[mattgwwalker #75](https://github.com/Syncurity/msg-extractor/issues/75)] & [[Syncurity #39](https://github.com/Syncurity/msg-extractor/issues/39)] Completely rewrote the function `Message._getStringStream`. This was done for two reasons. The first was to make it actually work with msg files that have their strings encoded in a non-Unicode encoding. The second reason was to make it so that it better reflected msg specification which says that ALL strings in a file will be either Unicode or non-Unicode, but not both. Because of the second part, the `prefer` option has been removed.
* As part of fixing the two issues in the previous change, we have added two new properties:
1. a boolean `Message.areStringsUnicode` which tells if the strings are unicode encoded
2. A string `Message.stringEncoding` which tells what the encoding is. This is used by the `Message._getStringStream` to determine how to decode the data into a string.

**v0.22.1**
* [[mattgwwalker #69](https://github.com/Syncurity/msg-extractor/issues/69)] Fixed date format not being up to standard.
* Fixed a minor spelling error in the code.
Expand Down
87 changes: 55 additions & 32 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@ Usage
This will produce a new folder named according to the date, time and
subject of the message (for example “2013-07-24_0915 Example”). The
email itself can be found inside the new folder along with the
attachments. As of version 0.2, it is capable of extracting both ASCII
and Unicode data.
attachments.

The script uses Philippe Lagadec’s Python module that reads Microsoft
OLE2 files (also called Structured Storage, Compound File Binary Format
Expand All @@ -36,37 +35,41 @@ The script was built using Peter Fiskerstrand’s documentation of the
used within Extended MAPI was also useful. For future reference, I note
that Microsoft have opened up their documentation of the file format.

If you are having difficulty with a specific file, or would like to
extract more than is currently automated, then the –raw flag may be
useful:

#########REWRITE COMMAND LINE USAGE#############
Currently, the README is in the process of being redone. For now, please
refer to the usage information provided from the program's help dialog:
::

python extract_msg --raw example.msg

Further, a –json flag has been added by Joel Kaufman to specify JSON
output:

::

python extract_msg --json example.msg

Joel also added a –use-file-name flag, which allows you to specify that
the script writes the emails’ contents to the names of the .msg files,
rather than using the subject and date to name the folder:

::

python extract_msg --use-file-name example.msg

Creation also added a –use-content-id flag, which allows you to specify
that attachments should be saved under the name of their content id,
should they have one. This can be useful for matching attachments to the
names used in the HTML body, and can be done like so:

::

python extract_msg --use-content-id example.msg
usage: extract_msg [-h] [--use-content-id] [--dev] [--validate] [--json]
[--file-logging] [--verbose] [--log LOG]
[--config CONFIG_PATH] [--out OUT_PATH] [--use-filename]
msg [msg ...]

extract_msg: Extracts emails and attachments saved in Microsoft Outlook's .msg
files. https://github.com/mattgwwalker/msg-extractor

positional arguments:
msg An msg file to be parsed

optional arguments:
-h, --help show this help message and exit
--use-content-id, --cid
Save attachments by their Content ID, if they have
one. Useful when working with the HTML body.
--dev Changes to use developer mode. Automatically enables
the --verbose flag. Takes precedence over the
--validate flag.
--validate Turns on file validation mode. Turns off regular file
output.
--json Changes to write output files as json.
--file-logging Enables file logging. Implies --verbose
--verbose Turns on console logging.
--log LOG Set the path to write the file log to.
--config CONFIG_PATH Set the path to load the logging config from.
--out OUT_PATH Set the folder to use for the program output.
(Default: Current directory)
--use-filename Sets whether the name of each output is based on the
msg filename.

**To use this in your own script**, start by using:

Expand All @@ -85,7 +88,7 @@ to the ExtractMsg.Message Method:

::

msg_raw = b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x0 ... \x00x00x00'
msg_raw = b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00 ... \x00x00x00'
msg = extract_msg.Message(msg_raw)

If you want to override the default attachment class and use one of your
Expand Down Expand Up @@ -155,8 +158,23 @@ Here is a list of things that are currently on our todo list:
* Provide a way to save attachments and messages into a custom location under a custom name
* Implement better property handling that will convert each type into a python equivalent if possible
* Implement handling of named properties
* Improve README
* Create a wiki for advanced usage information

Credits
-------

`Matthew Walker`_ - Original developer and owner

`Ken Peterson (The Elemental of Creation)`_ - Principle programmer, manager, and msg file "expert"

`JP Bourget`_ - Senior programmer, readability and organization expert, secondary manager

`Philippe Lagadec`_ - Python OleFile module developer

Joel Kaufman - First implementations of the json and filename flags

`Dean Malmgren`_ - First implementation of the setup.py script

.. |License: GPL v3| image:: https://img.shields.io/badge/License-GPLv3-blue.svg
:target: LICENSE.txt
Expand All @@ -166,3 +184,8 @@ Here is a list of things that are currently on our todo list:
:target: https://www.python.org/downloads/release/python-2715/
.. |PyPI2| image:: https://img.shields.io/badge/python-3.6+-brightgreen.svg
:target: https://www.python.org/downloads/release/python-367/
.. _Matthew Walker: https://github.com/mattgwwalker
.. _Ken Peterson (The Elemental of Creation): https://github.com/TheElementalOfCreation
.. _JP Bourget: https://github.com/punkrokk
.. _Philippe Lagadec: https://github.com/decalage2
.. _Dean Malmgren: https://github.com/deanmalmgren
4 changes: 2 additions & 2 deletions extract_msg/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@
# along with this program. If not, see <http://www.gnu.org/licenses/>.

__author__ = 'Matthew Walker & The Elemental of Creation'
__date__ = '2018-12-05'
__version__ = '0.22.1'
__date__ = '2019-04-18'
__version__ = '0.23.0'

from extract_msg import constants
from extract_msg.attachment import Attachment
Expand Down
59 changes: 41 additions & 18 deletions extract_msg/message.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,30 +150,18 @@ def _getStream(self, filename, prefix=True):
logger.info('Stream "{}" was requested but could not be found. Returning `None`.'.format(filename))
return None

def _getStringStream(self, filename, prefer='unicode', prefix=True):
def _getStringStream(self, filename, prefix=True):
"""
Gets a string representation of the requested filename.
Checks for both ASCII and Unicode representations and returns
a value if possible. If there are both ASCII and Unicode
versions, then :param prefer: specifies which will be
returned.
This should ALWAYS return a string (Unicode in python 2)
"""

filename = self.fix_path(filename, prefix)

asciiVersion = self._getStream(filename + '001E', prefix = False)
unicodeVersion = windowsUnicode(self._getStream(filename + '001F', prefix = False))
logger.debug('_getStringStream called for {}. Ascii version found: {}. Unicode version found: {}.'.format(
filename, asciiVersion is not None, unicodeVersion is not None))
if asciiVersion is None:
return unicodeVersion
elif unicodeVersion is None:
return asciiVersion
if self.areStringsUnicode:
return windowsUnicode(self._getStream(filename + '001F', prefix = False))
else:
if prefer == 'unicode':
return unicodeVersion
else:
return asciiVersion
tmp = self._getStream(filename + '001E', prefix = False)
return None if tmp is None else tmp.decode(self.stringEncoding)

@property
def path(self):
Expand Down Expand Up @@ -286,6 +274,41 @@ def date(self):
def parsedDate(self):
return email.utils.parsedate(self.date)

@property
def stringEncoding(self):
try:
return self.__stringEncoding
except AttributeError:
# We need to calculate the encoding
# Let's first check if the encoding will be unicode:
if self.areStringsUnicode:
self.__stringEncoding = "utf-16-le"
return self.__stringEncoding
else:
# Well, it's not unicode. Now we have to figure out what it IS.
if not self.mainProperties.has_key('3FFD0003'):
raise Exception('Encoding property not found')
enc = self.mainProperties['3FFD0003'].value
# Now we just need to translate that value
# Now, this next line SHOULD work, but it is possible that it might not...
self.__stringEncoding = str(enc)
return self.__stringEncoding

@property
def areStringsUnicode(self):
"""
Returns a boolean telling if the strings are unicode encoded.
"""
try:
return self.__bStringsUnicode
except AttributeError:
if self.mainProperties.has_key('340D0003'):
if (self.mainProperties['340D0003'].value & 0x40000) != 0:
self.__bStringsUnicode = True
return self.__bStringsUnicode
self.__bStringsUnicode = False
return self.__bStringsUnicode

@property
def sender(self):
"""
Expand Down
12 changes: 4 additions & 8 deletions extract_msg/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,7 @@ def properHex(inp):


def windowsUnicode(string):
if string is None:
return None
return str(string, 'utf_16_le')
return str(string, 'utf_16_le') if string is not None else None


def xstr(s):
Expand Down Expand Up @@ -80,9 +78,7 @@ def properHex(inp):


def windowsUnicode(string):
if string is None:
return None
return unicode(string, 'utf_16_le')
return unicode(string, 'utf_16_le') if string is not None else None


def xstr(s):
Expand Down Expand Up @@ -146,10 +142,10 @@ def get_command_args(args):
help='Changes to write output files as json.')
# --file-logging
parser.add_argument('--file-logging', dest='file_logging', action='store_true',
help='Enables file logging.')
help='Enables file logging. Implies --verbose')
# --verbose
parser.add_argument('--verbose', dest='verbose', action='store_true',
help='Turns on console logging. Implies --verbose')
help='Turns on console logging.')
# --log PATH
parser.add_argument('--log', dest='log',
help='Set the path to write the file log to.')
Expand Down

0 comments on commit 579a02a

Please sign in to comment.