You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Edit: Ok i was able to solve it in my case at least. I was using docker and since you pass in the data via stdin instead of a file, it didn't have the file extension to better determine the data type. Including the --extension parameter like --extension html then caused it to work.
Strangely I couldn't reproduce the issue with docx, so not sure what the issue is there.
It has a single right quote in there that causes this error, then outputs a totally empty file.
Traceback (most recent call last):
File "/usr/local/bin/markitdown", line 8, in <module>
sys.exit(main())
~~~~^^
File "/usr/local/lib/python3.13/site-packages/markitdown/__main__.py", line 191, in main
result = markitdown.convert_stream(
sys.stdin.buffer,
stream_info=stream_info,
keep_data_uris=args.keep_data_uris,
)
File "/usr/local/lib/python3.13/site-packages/markitdown/_markitdown.py", line 374, in convert_stream
return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/markitdown/_markitdown.py", line 613, in _convert
raise FileConversionException(attempts=failed_attempts)
markitdown._exceptions.FileConversionException: File conversion failed after 1 attempts:
- PlainTextConverter threw UnicodeDecodeError with message: 'ascii' codec can't decode byte 0xe2 in position 14544: ordinal not in range(128)
What's weird is if I copy the word into its own text file like:
It’s
It doesn't seem to cause the error 🤔. Perhaps it's incorrectly determining the data type.
When I use the -c UTF-8 argument, it doesn't throw the error but it does output the exact same original file. But when I use -m text/html it does correctly convert it.
Bug: Incorrect parsing of Unicode smart quotes from
.docx
filesWhen using MarkItDown to convert
.docx
files created by Microsoft Word (default settings, smart quotes enabled), Unicode characters such as:’
U+2019)“
U+201C)”
U+201D)are incorrectly parsed and appear in the Markdown output as corrupted characters like
Æ
,ô
,ö
.Steps to Reproduce:
.docx
in Word with smart quotes enabled (default setting).It’s important to “quote” text properly.
.docx
to.md
.Expected Behavior:
Smart punctuation should either:
'
and"
).Actual Behavior:
Corrupted non-ASCII characters appear in Markdown.
Workarounds:
.docx
smart punctuation correctly.Environment:
The text was updated successfully, but these errors were encountered: