PDF to Markdown doesn't preserve text relationship or indentation #83
Labels
enhancement
New feature or request
open for contribution
Invites open-source developers to contribute to the project.
Here's a sample PDF - https://www.bnm.gov.my/documents/20124/963937/Risk+Management+in+Technology+(RMiT).pdf/810b088e-6f4f-aa35-b603-1208ace33619?t=1592866162078
However there are several parsing errors which I will try to highlight below
Line elements aren't preserved
Output
Indentations are ignored and text ordering is altered
List elements are ignored and presented in entirely new lines
PDF parsing will always be a PITA, but I think these issues can be addressed by tracking the locations of the elements, right now I feel it simply loops over the textual elements and uses simple algorithms to merge them together
The text was updated successfully, but these errors were encountered: