Skip to content

Incorrect parsing of Unicode smart quotes from .docx files #1219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MacroPythonista opened this issue Apr 27, 2025 · 1 comment
Open

Incorrect parsing of Unicode smart quotes from .docx files #1219

MacroPythonista opened this issue Apr 27, 2025 · 1 comment

Comments

@MacroPythonista
Copy link

Bug: Incorrect parsing of Unicode smart quotes from .docx files

When using MarkItDown to convert .docx files created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:

  • Apostrophes ( U+2019)
  • Left double quotes ( U+201C)
  • Right double quotes ( U+201D)

are incorrectly parsed and appear in the Markdown output as corrupted characters like Æ, ô, ö.

Steps to Reproduce:

  1. Create a new .docx in Word with smart quotes enabled (default setting).
  2. Add text such as: It’s important to “quote” text properly.
  3. Run MarkItDown to convert the .docx to .md.
  4. Observe corrupted characters in the output.

Expected Behavior:
Smart punctuation should either:

  • Be preserved correctly as Unicode characters, or
  • Be flattened gracefully to ASCII equivalents (' and ").

Actual Behavior:
Corrupted non-ASCII characters appear in Markdown.

Workarounds:

  • Disabling smart quotes in Word avoids the issue.
  • Alternative tools like Pandoc handle .docx smart punctuation correctly.

Environment:

  • MarkItDown version: 0.1.1
  • Python version: 3.12
  • OS: Windows 11
@ThioJoe
Copy link

ThioJoe commented May 14, 2025

Edit: Ok i was able to solve it in my case at least. I was using docker and since you pass in the data via stdin instead of a file, it didn't have the file extension to better determine the data type. Including the --extension parameter like --extension html then caused it to work.

Strangely I couldn't reproduce the issue with docx, so not sure what the issue is there.


Also got this from an HTML file, specifically this one:
https://github.com/rainmeter/rainmeter-docs/blob/a9e9c49dc6276ede2c21f4ed5f31703ef0d50c75/source/manual/lua-scripting/inline-lua.html

It has a single right quote in there that causes this error, then outputs a totally empty file.

Traceback (most recent call last):
  File "/usr/local/bin/markitdown", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/usr/local/lib/python3.13/site-packages/markitdown/__main__.py", line 191, in main
    result = markitdown.convert_stream(
        sys.stdin.buffer,
        stream_info=stream_info,
        keep_data_uris=args.keep_data_uris,
    )
  File "/usr/local/lib/python3.13/site-packages/markitdown/_markitdown.py", line 374, in convert_stream
    return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/markitdown/_markitdown.py", line 613, in _convert
    raise FileConversionException(attempts=failed_attempts)
markitdown._exceptions.FileConversionException: File conversion failed after 1 attempts:
 - PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte 0xe2 in position 14544: ordinal not in range(128)

What's weird is if I copy the word into its own text file like:

It’s

It doesn't seem to cause the error 🤔. Perhaps it's incorrectly determining the data type.

When I use the -c UTF-8 argument, it doesn't throw the error but it does output the exact same original file. But when I use -m text/html it does correctly convert it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants