Skip to content

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Sghosh1999
Copy link

Description

Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.

Changes

  • Updated PdfConverter to first attempt text extraction with pdfminer as before.
  • If no text is found, the converter falls back to OCR using pytesseract and pdf2image.
  • Added clear error messages if OCR dependencies are missing.
  • Updated documentation/comments to include installation instructions for new dependencies.

Example Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("scanned-document.pdf")
print(result.text_content)  # Will show OCR-extracted text if the PDF was not searchable

Related Issues

Closes #1156 — Pdf file conversion not working when pdf file is non scanable

@Sghosh1999
Copy link
Author

@Sghosh1999 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@afourney
Copy link
Member

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pdf file conversion not wokring when pdf file is non scanable
2 participants