-
Notifications
You must be signed in to change notification settings - Fork 3k
It not convert pdf to markdown as expected #1117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Unfortunately, PDF conversion is pretty rudimentary right now (using pdfminer.six under the hood). It would be good to upgrade this. I'm looking into ways to accomplish this locally. |
FYI |
Thank you, But that project is not the same, it does not return markdown code |
Yes... probably because PDFs is a very difficult format to handle, so I just developed my own tool that supports all formats, including PDFs and images, using vision language model at the backend Feel free to let me know your experience https://github.com/RoffyS/MarkEverythingDown |
Thanks. I agree that we need better PDF handling support and that better options exist. The MarkItDown project is originally an offshoot of the FileSurferAgent in Magentic-One, (part of AutoGen) so PDF support was originally developed only as far as was needed to expediently get the bulks of a PDF's text into the LLM context window. But the project has grown since then, and we should clearly revisit this. In the meantime, it you still want to use MarkItDown for file format detection, or for other formats, it may well be worth wrapping more sophisticated PDF parsers into a MarkItDown plugin, which will take precedence. As we think about better PDF parsing, I also want to strike a balance with fidelity vs the size of the dependencies, which is one reason why we don't currently include easyocr etc. (though we were using it before). I may revisit this, now that we've moved to a model of having many optional extras. Also newer models like Phi-4 multimodal can handle many tasks like transcription and image captioning, and so might just be generally very useful. |
I will check if I can write a vision model API wrapper for MarkitDown plugin. I was playing around with OCR models, which most PDF-to-Markdown solutions are using. It turns out that even the latest OCR failed to fully capture the positional relationship of elements in the document, not to say captioning. IMO it is definitely a right path to integrate multimodal LLM 👍 |
were you able to find or use a plugin/tool for converting PDFs into Markdown format? I'm currently in a similar situation and looking for a way to efficiently convert PDF content into Markdown, with a particular focus on preserving table formatting. |
Maybe docling? - which I haven't tried out yet. |
Uh oh!
There was an error while loading. Please reload this page.
Reproduct:
install 'markitdown[all]~=0.1.0a1'
Test with this file Untitled 1.pdf
Untitled 1.pdf
Use this code:
And it returns
Not as Expected:
The text was updated successfully, but these errors were encountered: