Support for Parallel Processing of files #135

rudrakshkarpe · 2024-12-19T04:43:04Z

Description

Currently, PDF generation with markitdown processes pages sequentially, leading to longer rendering times. By adding the ability to process PDF pages in parallel, we can significantly improve performance and reduce overall generation time. Please consider implementing parallel page processing support.

Current PDF Conversion Performance

File	Size	Pages	Time (seconds)
Test_parsing_50MB.pdf	50MB	1113	34.64
Test_parsing_20MB.pdf	20MB	598	9.23
Test_parsing_20MB.pdf	2MB	23	0.27

AliHaider20 · 2024-12-19T05:43:23Z

Were you using the LLM client? Even if not that is not much of a time. As per my knowledge the parallel processing is not a possibility in Python due to Deadlock issue.

rudrakshkarpe · 2024-12-19T05:52:26Z

Nope, I used it without LLM client. I agree on the deadlock issue still is there way we can process the pages parallelly. So that, markdown can be considered over the PyMuPDF or other similar fast processing libraries?

AliHaider20 · 2024-12-19T06:49:22Z

I see that PyMuPDF is faster, but lacks in performance. But, yeah I just looked at their code and there is a possibility of parallel processing multiple documents and not pages. For processing multiple pages pdf_miner_6 should support it.

Reference: https://medium.com/social-impact-analytics/comparing-4-methods-for-pdf-text-extraction-in-python-fd34531034f

sqrt676 · 2024-12-19T16:04:14Z

I agree with @rudrakshkarpe on this. This will help in digitization pipeline for heavy doc load. We need to add a flag and add support for this as a service layer.

prateekralhan · 2024-12-21T23:40:17Z

@rudrakshkarpe , @sqrt676 why don't you use pyspark for doing this if you are trying to do this on a big document volume corpus?
Just wrap your python function for MD conversion as a pyspark UDF and use it.

Here is a very basic example: https://www.geeksforgeeks.org/convert-python-functions-into-pyspark-udf/

rudrakshkarpe · 2024-12-23T04:30:06Z

@rudrakshkarpe , @sqrt676 why don't you use pyspark for doing this if you are trying to do this on a big document volume corpus? Just wrap your python function for MD conversion as a pyspark UDF and use it.

Here is a very basic example: https://www.geeksforgeeks.org/convert-python-functions-into-pyspark-udf/

@prateekralhan I am afraid the pyspark operation will add more latency in the process?

prateekralhan · 2024-12-23T17:24:47Z

@rudrakshkarpe , In terms of what python offers natively this would be still far more performant.

If you still feel that you need more optimization, you can use pandas_udf() based functionality and you need to ensure your spark executors have enough memory to perform these operations based on document corpus' size.

rudrakshkarpe · 2024-12-29T07:43:45Z

Thanks @prateekralhan, I'll try with this approach

gagb added enhancement New feature or request open for contribution Invites open-source developers to contribute to the project. labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Parallel Processing of files #135

Support for Parallel Processing of files #135

rudrakshkarpe commented Dec 19, 2024 •

edited

Loading

AliHaider20 commented Dec 19, 2024 •

edited

Loading

Uh oh!

rudrakshkarpe commented Dec 19, 2024 •

edited

Loading

Uh oh!

AliHaider20 commented Dec 19, 2024 •

edited

Loading

Uh oh!

sqrt676 commented Dec 19, 2024

Uh oh!

prateekralhan commented Dec 21, 2024

Uh oh!

rudrakshkarpe commented Dec 23, 2024

Uh oh!

prateekralhan commented Dec 23, 2024

Uh oh!

rudrakshkarpe commented Dec 29, 2024

Uh oh!

Support for Parallel Processing of files #135

Support for Parallel Processing of files #135

Comments

rudrakshkarpe commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Current PDF Conversion Performance

AliHaider20 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rudrakshkarpe commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AliHaider20 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sqrt676 commented Dec 19, 2024

Uh oh!

prateekralhan commented Dec 21, 2024

Uh oh!

rudrakshkarpe commented Dec 23, 2024

Uh oh!

prateekralhan commented Dec 23, 2024

Uh oh!

rudrakshkarpe commented Dec 29, 2024

Uh oh!

rudrakshkarpe commented Dec 19, 2024 •

edited

Loading

AliHaider20 commented Dec 19, 2024 •

edited

Loading

rudrakshkarpe commented Dec 19, 2024 •

edited

Loading

AliHaider20 commented Dec 19, 2024 •

edited

Loading