Incorrect parsing of Unicode smart quotes from `.docx` files #1219

MacroPythonista · 2025-04-27T05:42:16Z

Bug: Incorrect parsing of Unicode smart quotes from `.docx` files

When using MarkItDown to convert .docx files created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:

Apostrophes (’ U+2019)
Left double quotes (“ U+201C)
Right double quotes (” U+201D)

are incorrectly parsed and appear in the Markdown output as corrupted characters like Æ, ô, ö.

Steps to Reproduce:

Create a new .docx in Word with smart quotes enabled (default setting).
Add text such as: It’s important to “quote” text properly.
Run MarkItDown to convert the .docx to .md.
Observe corrupted characters in the output.

Expected Behavior:
Smart punctuation should either:

Be preserved correctly as Unicode characters, or
Be flattened gracefully to ASCII equivalents (' and ").

Actual Behavior:
Corrupted non-ASCII characters appear in Markdown.

Workarounds:

Disabling smart quotes in Word avoids the issue.
Alternative tools like Pandoc handle .docx smart punctuation correctly.

Environment:

MarkItDown version: 0.1.1
Python version: 3.12
OS: Windows 11

The text was updated successfully, but these errors were encountered:

ThioJoe · 2025-05-14T23:07:41Z

Edit: Ok i was able to solve it in my case at least. I was using docker and since you pass in the data via stdin instead of a file, it didn't have the file extension to better determine the data type. Including the --extension parameter like --extension html then caused it to work.

Strangely I couldn't reproduce the issue with docx, so not sure what the issue is there.

Also got this from an HTML file, specifically this one:
https://github.com/rainmeter/rainmeter-docs/blob/a9e9c49dc6276ede2c21f4ed5f31703ef0d50c75/source/manual/lua-scripting/inline-lua.html

It has a single right quote in there that causes this error, then outputs a totally empty file.

Traceback (most recent call last):
  File "/usr/local/bin/markitdown", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/usr/local/lib/python3.13/site-packages/markitdown/__main__.py", line 191, in main
    result = markitdown.convert_stream(
        sys.stdin.buffer,
        stream_info=stream_info,
        keep_data_uris=args.keep_data_uris,
    )
  File "/usr/local/lib/python3.13/site-packages/markitdown/_markitdown.py", line 374, in convert_stream
    return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/markitdown/_markitdown.py", line 613, in _convert
    raise FileConversionException(attempts=failed_attempts)
markitdown._exceptions.FileConversionException: File conversion failed after 1 attempts:
 - PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte 0xe2 in position 14544: ordinal not in range(128)

What's weird is if I copy the word into its own text file like:

It’s

It doesn't seem to cause the error 🤔. Perhaps it's incorrectly determining the data type.

When I use the -c UTF-8 argument, it doesn't throw the error but it does output the exact same original file. But when I use -m text/html it does correctly convert it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect parsing of Unicode smart quotes from `.docx` files #1219

Incorrect parsing of Unicode smart quotes from `.docx` files #1219

MacroPythonista commented Apr 27, 2025

ThioJoe commented May 14, 2025 •

edited

Loading

Uh oh!

Incorrect parsing of Unicode smart quotes from .docx files #1219

Incorrect parsing of Unicode smart quotes from .docx files #1219

Comments

MacroPythonista commented Apr 27, 2025

Bug: Incorrect parsing of Unicode smart quotes from .docx files

ThioJoe commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Incorrect parsing of Unicode smart quotes from `.docx` files #1219

Incorrect parsing of Unicode smart quotes from `.docx` files #1219

Bug: Incorrect parsing of Unicode smart quotes from `.docx` files

ThioJoe commented May 14, 2025 •

edited

Loading