Incorrect parsing of Unicode smart quotes from `.docx` files

## Bug: Incorrect parsing of Unicode smart quotes from `.docx` files

When using MarkItDown to convert `.docx` files created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:

- Apostrophes (`’` U+2019)
- Left double quotes (`“` U+201C)
- Right double quotes (`”` U+201D)

are incorrectly parsed and appear in the Markdown output as corrupted characters like `Æ`, `ô`, `ö`.

**Steps to Reproduce:**
1. Create a new `.docx` in Word with smart quotes enabled (default setting).
2. Add text such as: `It’s important to “quote” text properly.`
3. Run MarkItDown to convert the `.docx` to `.md`.
4. Observe corrupted characters in the output.

**Expected Behavior:**
Smart punctuation should either:
- Be preserved correctly as Unicode characters, or
- Be flattened gracefully to ASCII equivalents (`'` and `"`).

**Actual Behavior:**
Corrupted non-ASCII characters appear in Markdown.

**Workarounds:**
- Disabling smart quotes in Word avoids the issue.
- Alternative tools like Pandoc handle `.docx` smart punctuation correctly.

---

**Environment:**
- MarkItDown version: 0.1.1
- Python version: 3.12
- OS: Windows 11


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect parsing of Unicode smart quotes from `.docx` files #1219

Bug: Incorrect parsing of Unicode smart quotes from `.docx` files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect parsing of Unicode smart quotes from .docx files #1219

Description

Bug: Incorrect parsing of Unicode smart quotes from .docx files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Incorrect parsing of Unicode smart quotes from `.docx` files #1219

Bug: Incorrect parsing of Unicode smart quotes from `.docx` files