Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Sghosh1999 · 2025-05-25T15:01:19Z

Description

Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.

Changes

Updated PdfConverter to first attempt text extraction with pdfminer as before.
If no text is found, the converter falls back to OCR using pytesseract and pdf2image.
Added clear error messages if OCR dependencies are missing.
Updated documentation/comments to include installation instructions for new dependencies.

Example Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("scanned-document.pdf")
print(result.text_content)  # Will show OCR-extracted text if the PDF was not searchable

Related Issues

Closes #1156 — Pdf file conversion not working when pdf file is non scanable

Sghosh1999 · 2025-05-25T15:02:31Z

@Sghosh1999 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

afourney · 2025-05-28T16:59:26Z

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

Sghosh1999 added 2 commits May 25, 2025 19:43

Add OCR fallback for non-searchable PDFs (fixes microsoft#1156)

35e32d5

Added pytesseract and pdf2image dependency.

523f796

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Uh oh!

Sghosh1999 commented May 25, 2025

Uh oh!

Sghosh1999 commented May 25, 2025

Uh oh!

afourney commented May 28, 2025

Uh oh!

Uh oh!

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Are you sure you want to change the base?

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Uh oh!

Conversation

Sghosh1999 commented May 25, 2025

Description

Changes

Example Usage

Related Issues

Uh oh!

Sghosh1999 commented May 25, 2025

Uh oh!

afourney commented May 28, 2025

Uh oh!

Uh oh!