Bug: Incorrect parsing of Unicode smart quotes from .docx files
When using MarkItDown to convert .docx files created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:
- Apostrophes (
’ U+2019)
- Left double quotes (
“ U+201C)
- Right double quotes (
” U+201D)
are incorrectly parsed and appear in the Markdown output as corrupted characters like Æ, ô, ö.
Steps to Reproduce:
- Create a new
.docx in Word with smart quotes enabled (default setting).
- Add text such as:
It’s important to “quote” text properly.
- Run MarkItDown to convert the
.docx to .md.
- Observe corrupted characters in the output.
Expected Behavior:
Smart punctuation should either:
- Be preserved correctly as Unicode characters, or
- Be flattened gracefully to ASCII equivalents (
' and ").
Actual Behavior:
Corrupted non-ASCII characters appear in Markdown.
Workarounds:
- Disabling smart quotes in Word avoids the issue.
- Alternative tools like Pandoc handle
.docx smart punctuation correctly.
Environment:
- MarkItDown version: 0.1.1
- Python version: 3.12
- OS: Windows 11
Bug: Incorrect parsing of Unicode smart quotes from
.docxfilesWhen using MarkItDown to convert
.docxfiles created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:’U+2019)“U+201C)”U+201D)are incorrectly parsed and appear in the Markdown output as corrupted characters like
Æ,ô,ö.Steps to Reproduce:
.docxin Word with smart quotes enabled (default setting).It’s important to “quote” text properly..docxto.md.Expected Behavior:
Smart punctuation should either:
'and").Actual Behavior:
Corrupted non-ASCII characters appear in Markdown.
Workarounds:
.docxsmart punctuation correctly.Environment: