-
Notifications
You must be signed in to change notification settings - Fork 3k
Support for Parallel Processing of files #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Were you using the LLM client? Even if not that is not much of a time. As per my knowledge the parallel processing is not a possibility in Python due to Deadlock issue. |
Nope, I used it without LLM client. I agree on the deadlock issue still is there way we can process the pages parallelly. So that, markdown can be considered over the PyMuPDF or other similar fast processing libraries? |
I see that PyMuPDF is faster, but lacks in performance. But, yeah I just looked at their code and there is a possibility of parallel processing multiple documents and not pages. For processing multiple pages pdf_miner_6 should support it. |
I agree with @rudrakshkarpe on this. This will help in digitization pipeline for heavy doc load. We need to add a flag and add support for this as a service layer. |
@rudrakshkarpe , @sqrt676 why don't you use pyspark for doing this if you are trying to do this on a big document volume corpus? Here is a very basic example: https://www.geeksforgeeks.org/convert-python-functions-into-pyspark-udf/ |
@prateekralhan I am afraid the pyspark operation will add more latency in the process? |
@rudrakshkarpe , In terms of what python offers natively this would be still far more performant. If you still feel that you need more optimization, you can use pandas_udf() based functionality and you need to ensure your spark executors have enough memory to perform these operations based on document corpus' size. |
Thanks @prateekralhan, I'll try with this approach |
Uh oh!
There was an error while loading. Please reload this page.
Description
Currently, PDF generation with markitdown processes pages sequentially, leading to longer rendering times. By adding the ability to process PDF pages in parallel, we can significantly improve performance and reduce overall generation time. Please consider implementing parallel page processing support.
Current PDF Conversion Performance
The text was updated successfully, but these errors were encountered: