Skip to content

Support for Parallel Processing of files #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rudrakshkarpe opened this issue Dec 19, 2024 · 8 comments
Open

Support for Parallel Processing of files #135

rudrakshkarpe opened this issue Dec 19, 2024 · 8 comments
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.

Comments

@rudrakshkarpe
Copy link

rudrakshkarpe commented Dec 19, 2024

Description

Currently, PDF generation with markitdown processes pages sequentially, leading to longer rendering times. By adding the ability to process PDF pages in parallel, we can significantly improve performance and reduce overall generation time. Please consider implementing parallel page processing support.

Current PDF Conversion Performance

File Size Pages Time (seconds)
Test_parsing_50MB.pdf 50MB 1113 34.64
Test_parsing_20MB.pdf 20MB 598 9.23
Test_parsing_20MB.pdf 2MB 23 0.27
@AliHaider20
Copy link

AliHaider20 commented Dec 19, 2024

Were you using the LLM client? Even if not that is not much of a time. As per my knowledge the parallel processing is not a possibility in Python due to Deadlock issue.

@rudrakshkarpe
Copy link
Author

rudrakshkarpe commented Dec 19, 2024

Nope, I used it without LLM client. I agree on the deadlock issue still is there way we can process the pages parallelly. So that, markdown can be considered over the PyMuPDF or other similar fast processing libraries?

@AliHaider20
Copy link

AliHaider20 commented Dec 19, 2024

I see that PyMuPDF is faster, but lacks in performance. But, yeah I just looked at their code and there is a possibility of parallel processing multiple documents and not pages. For processing multiple pages pdf_miner_6 should support it.

Reference: https://medium.com/social-impact-analytics/comparing-4-methods-for-pdf-text-extraction-in-python-fd34531034f

@gagb gagb added enhancement New feature or request open for contribution Invites open-source developers to contribute to the project. labels Dec 19, 2024
@sqrt676
Copy link

sqrt676 commented Dec 19, 2024

I agree with @rudrakshkarpe on this. This will help in digitization pipeline for heavy doc load. We need to add a flag and add support for this as a service layer.

@prateekralhan
Copy link

@rudrakshkarpe , @sqrt676 why don't you use pyspark for doing this if you are trying to do this on a big document volume corpus?
Just wrap your python function for MD conversion as a pyspark UDF and use it.

Here is a very basic example: https://www.geeksforgeeks.org/convert-python-functions-into-pyspark-udf/

@rudrakshkarpe
Copy link
Author

@rudrakshkarpe , @sqrt676 why don't you use pyspark for doing this if you are trying to do this on a big document volume corpus? Just wrap your python function for MD conversion as a pyspark UDF and use it.

Here is a very basic example: https://www.geeksforgeeks.org/convert-python-functions-into-pyspark-udf/

@prateekralhan I am afraid the pyspark operation will add more latency in the process?

@prateekralhan
Copy link

@rudrakshkarpe , In terms of what python offers natively this would be still far more performant.

If you still feel that you need more optimization, you can use pandas_udf() based functionality and you need to ensure your spark executors have enough memory to perform these operations based on document corpus' size.

@rudrakshkarpe
Copy link
Author

Thanks @prateekralhan, I'll try with this approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.
Projects
None yet
Development

No branches or pull requests

5 participants