Skip to content

Cloud not convert stream / pdf to markdown #1134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TorgeStahl opened this issue Mar 16, 2025 · 3 comments
Open

Cloud not convert stream / pdf to markdown #1134

TorgeStahl opened this issue Mar 16, 2025 · 3 comments

Comments

@TorgeStahl
Copy link

Hey there,

i wanted to generate a markdown of a really long pdf document (roughly around 100 pages). Simple print works, but as soon as it should be converted to markdown, it gives the following issue below. Is there a now limitation to the length of a document?

Traceback (most recent call last):
File "/Users/user/Desktop/Repositories/markitdown/script/markdown.py", line 73, in
main()
~~~~^^
File "/Users/user/Desktop/Repositories/markitdown/script/markdown.py", line 34, in main
text = process_file(file_path)
File "/Users/user/Desktop/Repositories/markitdown/script/markdown.py", line 19, in process_file
result = md.convert(file_path)
File "/Users/user/Desktop/Repositories/markitdown/packages/markitdown/src/markitdown/_markitdown.py", line 259, in convert
return self.convert_local(source, stream_info=stream_info, **kwargs)
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Desktop/Repositories/markitdown/packages/markitdown/src/markitdown/_markitdown.py", line 310, in convert_local
return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Desktop/Repositories/markitdown/packages/markitdown/src/markitdown/_markitdown.py", line 541, in _convert
raise UnsupportedFormatException(
f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
)
markitdown._exceptions.UnsupportedFormatException: Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported

@afourney
Copy link
Member

Thanks for the report. Let's get to the bottom of this.

What version of the library are you using? Did you install it with [all] or at least [pdf]?
Is this a problem with all (e.g., smaller) PDFs? Or just this one?
Are you using the python library or the command line?

On my plate is to add a debug option and more python logging, to better support debugging these types of scenarios.

@bdnguyen-ds
Copy link

bdnguyen-ds commented Apr 15, 2025

seeing the same. installed markitdown version 0.1.1

using: "pip install -e packages/markitdown[all]"
returns: "zsh: no matches found: packages/markitdown[all]"
and similarly for [pdf] and other options.

The only install command that didn't fail was this (below), but it leads to something like OP's reported error above when used:

pip install -e packages/markitdown
Obtaining file:///users/name/localpath/somedir/markitdown/packages/markitdown
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Installing backend dependencies ... done
Preparing editable metadata (pyproject.toml) ... done

====================
Traceback:

/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
---------------------------------------------------------------------------
FileConversionException                   Traceback (most recent call last)
Cell In[2], line 2
      1 md = MarkItDown()
----> 2 result = md.convert('../test_report.pdf')

File ~/some-path-to-here/markitdown/packages/markitdown/src/markitdown/_markitdown.py:273, in MarkItDown.convert(self, source, stream_info, **kwargs)
    271         return self.convert_uri(source, stream_info=stream_info, **_kwargs)
    272     else:
--> 273         return self.convert_local(source, stream_info=stream_info, **kwargs)
    274 # Path object
    275 elif isinstance(source, Path):

File ~/some-path-to-here/markitdown/packages/markitdown/src/markitdown/_markitdown.py:327, in MarkItDown.convert_local(self, path, stream_info, file_extension, url, **kwargs)
    323 with open(path, "rb") as fh:
    324     guesses = self._get_stream_info_guesses(
    325         file_stream=fh, base_guess=base_guess
    326     )
--> 327     return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)

File ~/some-path-to-here/markitdown/packages/markitdown/src/markitdown/_markitdown.py:613, in MarkItDown._convert(self, file_stream, stream_info_guesses, **kwargs)
    611 # If we got this far without success, report any exceptions
    612 if len(failed_attempts) > 0:
--> 613     raise FileConversionException(attempts=failed_attempts)
    615 # Nothing can handle it!
    616 raise UnsupportedFormatException(
    617     f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
    618 )

FileConversionException: File conversion failed after 1 attempts:
 - PdfConverter threw MissingDependencyException with message: PdfConverter recognized the input as a potential .pdf file, but the dependencies needed to read .pdf files have not been installed. To resolve this error, include the optional dependency [pdf] or [all] when installing MarkItDown. For example:

* pip install markitdown[pdf]
* pip install markitdown[all]
* pip install markitdown[pdf, ...]
* etc.

@bdnguyen-ds
Copy link

Ignore my previous comment, it was a "me" issue. Referencing here in case anyone runs into the same thing. Adding quotation marks the around the target ( 'markitdown[all]' ) allowed proper install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants