Can not create markdownfile from bengali pdf #1185

amitabha81 · 2025-04-12T17:43:48Z

I have downloaded a book from archive in .pdf format. Write the below code:

#md = MarkItDown(use_azure=False)  # Important: set this flag
md = MarkItDown(use_azure=False, ocr_mode=True)

<!-- Failed to upload "Bharater-Shilpa-sanskritir.pdf" -->

<!-- Failed to upload "Bharater-Shilpa-sanskritir.pdf" -->

file_path = "E:/NLP/Bengali LLM all/DATA/Bharater-Shilpa-sanskritir.pdf"
result = md.convert(file_path)

# Save to .md file
with open("output.md", "w", encoding="utf-8") as f:
    f.write(result.text_content)

print("Markdown saved as output.md")

The output file size is 0 kb.
The input file link is : [(https://archive.org/details/in.ernet.dli.2015.266550/page/n135/mode/2up)]
I have downloaded the pdf version.

The text was updated successfully, but these errors were encountered:

afourney · 2025-04-14T15:59:58Z

Thanks. Let me take a look.

Markitdown uses PDF miner to extract any existing text or OCR layer. Do you know if this document has one? (I will check).

You can also try the Azure Doc Intelligence integration, which handles PDFs better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can not create markdownfile from bengali pdf #1185

Can not create markdownfile from bengali pdf #1185

amitabha81 commented Apr 12, 2025

afourney commented Apr 14, 2025

Uh oh!

Can not create markdownfile from bengali pdf #1185

Can not create markdownfile from bengali pdf #1185

Comments

amitabha81 commented Apr 12, 2025

afourney commented Apr 14, 2025

Uh oh!