Skip to content

It not convert pdf to markdown as expected #1117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
phamxtien opened this issue Mar 11, 2025 · 8 comments
Open

It not convert pdf to markdown as expected #1117

phamxtien opened this issue Mar 11, 2025 · 8 comments

Comments

@phamxtien
Copy link

phamxtien commented Mar 11, 2025

Reproduct:

install 'markitdown[all]~=0.1.0a1'

Test with this file Untitled 1.pdf
Untitled 1.pdf

Use this code:

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)

result = md.convert('Untitle.pdf')

print(result.text_content)

And it returns

TEST

Hello, how are you?

Table 1: ABC

TT

1 Nội dung abc

2 Nội dung cde

Nội dung

Ghi chú

Ghi chú 1

Ghi chú 2

Trang 1/1

Not as Expected:

#TEST 
Hello, how are you?
**Table 1: ABC** 
|TT|Nội dung|Ghi chú|
|1|Nội dung abc|Ghi chú 1|
|2|Nội dung cde|Ghi chú 2|
@afourney
Copy link
Member

Thanks for the report. Unfortunately, PDF conversion is pretty rudimentary right now (using pdfminer.six under the hood). It would be good to upgrade this. I'm looking into ways to accomplish this locally.

@carlkl
Copy link

carlkl commented Mar 14, 2025

FYI
The kreuzberg project has the same objective. PDF conversion in this project is performed with the help of pdfium2 or OCR (tesseract).

@phamxtien
Copy link
Author

FYI
The kreuzberg project has the same objective. PDF conversion in this project is performed with the help of pdfium2 or OCR (tesseract).

Thank you,

But that project is not the same, it does not return markdown code

@RoffyS
Copy link

RoffyS commented Mar 16, 2025

Yes... probably because PDFs is a very difficult format to handle, so I just developed my own tool that supports all formats, including PDFs and images, using vision language model at the backend

Feel free to let me know your experience

https://github.com/RoffyS/MarkEverythingDown

@afourney
Copy link
Member

Thanks. I agree that we need better PDF handling support and that better options exist.

The MarkItDown project is originally an offshoot of the FileSurferAgent in Magentic-One, (part of AutoGen) so PDF support was originally developed only as far as was needed to expediently get the bulks of a PDF's text into the LLM context window. But the project has grown since then, and we should clearly revisit this.

In the meantime, it you still want to use MarkItDown for file format detection, or for other formats, it may well be worth wrapping more sophisticated PDF parsers into a MarkItDown plugin, which will take precedence.

As we think about better PDF parsing, I also want to strike a balance with fidelity vs the size of the dependencies, which is one reason why we don't currently include easyocr etc. (though we were using it before). I may revisit this, now that we've moved to a model of having many optional extras. Also newer models like Phi-4 multimodal can handle many tasks like transcription and image captioning, and so might just be generally very useful.

@RoffyS
Copy link

RoffyS commented Mar 16, 2025

I will check if I can write a vision model API wrapper for MarkitDown plugin. I was playing around with OCR models, which most PDF-to-Markdown solutions are using. It turns out that even the latest OCR failed to fully capture the positional relationship of elements in the document, not to say captioning. IMO it is definitely a right path to integrate multimodal LLM 👍

@aradhanachaturvedi
Copy link

were you able to find or use a plugin/tool for converting PDFs into Markdown format? I'm currently in a similar situation and looking for a way to efficiently convert PDF content into Markdown, with a particular focus on preserving table formatting.

@carlkl
Copy link

carlkl commented Apr 25, 2025

Maybe docling? - which I haven't tried out yet.
Or this link: https://docling-project.github.io/docling/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants