It not convert pdf to markdown as expected #1117

phamxtien · 2025-03-11T11:39:24Z

Reproduct:

install 'markitdown[all]~=0.1.0a1'

Test with this file Untitled 1.pdf
Untitled 1.pdf

Use this code:

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)

result = md.convert('Untitle.pdf')

print(result.text_content)

And it returns

TEST

Hello, how are you?

Table 1: ABC

TT

1 Nội dung abc

2 Nội dung cde

Nội dung

Ghi chú

Ghi chú 1

Ghi chú 2

Trang 1/1

Not as Expected:

#TEST 
Hello, how are you?
**Table 1: ABC** 
|TT|Nội dung|Ghi chú|
|1|Nội dung abc|Ghi chú 1|
|2|Nội dung cde|Ghi chú 2|

The text was updated successfully, but these errors were encountered:

afourney · 2025-03-11T14:47:59Z

Thanks for the report. Unfortunately, PDF conversion is pretty rudimentary right now (using pdfminer.six under the hood). It would be good to upgrade this. I'm looking into ways to accomplish this locally.

carlkl · 2025-03-14T09:17:17Z

FYI
The kreuzberg project has the same objective. PDF conversion in this project is performed with the help of pdfium2 or OCR (tesseract).

phamxtien · 2025-03-14T10:45:38Z

FYI
The kreuzberg project has the same objective. PDF conversion in this project is performed with the help of pdfium2 or OCR (tesseract).

Thank you,

But that project is not the same, it does not return markdown code

RoffyS · 2025-03-16T07:16:40Z

Yes... probably because PDFs is a very difficult format to handle, so I just developed my own tool that supports all formats, including PDFs and images, using vision language model at the backend

Feel free to let me know your experience

https://github.com/RoffyS/MarkEverythingDown

afourney · 2025-03-16T16:04:37Z

Thanks. I agree that we need better PDF handling support and that better options exist.

The MarkItDown project is originally an offshoot of the FileSurferAgent in Magentic-One, (part of AutoGen) so PDF support was originally developed only as far as was needed to expediently get the bulks of a PDF's text into the LLM context window. But the project has grown since then, and we should clearly revisit this.

In the meantime, it you still want to use MarkItDown for file format detection, or for other formats, it may well be worth wrapping more sophisticated PDF parsers into a MarkItDown plugin, which will take precedence.

As we think about better PDF parsing, I also want to strike a balance with fidelity vs the size of the dependencies, which is one reason why we don't currently include easyocr etc. (though we were using it before). I may revisit this, now that we've moved to a model of having many optional extras. Also newer models like Phi-4 multimodal can handle many tasks like transcription and image captioning, and so might just be generally very useful.

RoffyS · 2025-03-16T21:24:07Z

I will check if I can write a vision model API wrapper for MarkitDown plugin. I was playing around with OCR models, which most PDF-to-Markdown solutions are using. It turns out that even the latest OCR failed to fully capture the positional relationship of elements in the document, not to say captioning. IMO it is definitely a right path to integrate multimodal LLM 👍

aradhanachaturvedi · 2025-04-25T17:36:34Z

were you able to find or use a plugin/tool for converting PDFs into Markdown format? I'm currently in a similar situation and looking for a way to efficiently convert PDF content into Markdown, with a particular focus on preserving table formatting.

carlkl · 2025-04-25T17:40:39Z

Maybe docling? - which I haven't tried out yet.
Or this link: https://docling-project.github.io/docling/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It not convert pdf to markdown as expected #1117

It not convert pdf to markdown as expected #1117

phamxtien commented Mar 11, 2025 •

edited

Loading

afourney commented Mar 11, 2025

Uh oh!

carlkl commented Mar 14, 2025

Uh oh!

phamxtien commented Mar 14, 2025

Uh oh!

RoffyS commented Mar 16, 2025

Uh oh!

afourney commented Mar 16, 2025

Uh oh!

RoffyS commented Mar 16, 2025

Uh oh!

aradhanachaturvedi commented Apr 25, 2025

Uh oh!

carlkl commented Apr 25, 2025 •

edited

Loading

Uh oh!

It not convert pdf to markdown as expected #1117

It not convert pdf to markdown as expected #1117

Comments

phamxtien commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

afourney commented Mar 11, 2025

Uh oh!

carlkl commented Mar 14, 2025

Uh oh!

phamxtien commented Mar 14, 2025

Uh oh!

RoffyS commented Mar 16, 2025

Uh oh!

afourney commented Mar 16, 2025

Uh oh!

RoffyS commented Mar 16, 2025

Uh oh!

aradhanachaturvedi commented Apr 25, 2025

Uh oh!

carlkl commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phamxtien commented Mar 11, 2025 •

edited

Loading

carlkl commented Apr 25, 2025 •

edited

Loading