Skip to content

PDF to Markdown doesn't preserve text relationship or indentation #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NikhilVerma opened this issue Dec 17, 2024 · 6 comments
Open
Assignees
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.

Comments

@NikhilVerma
Copy link

Here's a sample PDF - https://www.bnm.gov.my/documents/20124/963937/Risk+Management+in+Technology+(RMiT).pdf/810b088e-6f4f-aa35-b603-1208ace33619?t=1592866162078

However there are several parsing errors which I will try to highlight below

Line elements aren't preserved

Image

Output

G  8.5

To  promote  effective  technology  discussions  at  the  board  level,  the
composition  of  the  board  and  the  designated  board-level  committee  should
include at least a member with technology experience and competencies.

Indentations are ignored and text ordering is altered

Image
The TRMF must include the following:
(a)  clear definition of technology risk;
(b)  clear responsibilities assigned for the management of technology risk at
different  levels  and  across  functions,  with  appropriate  governance  and
reporting arrangements;
the  identification  of  technology  risks  to  which  the  financial  institution  is
exposed,  including  risks  from  the  adoption  of  new  or  emerging
technology;

(c)

(d)  risk classification of all information assets/systems based on its criticality;
(e)  risk measurement and assessment approaches and methodologies;
(f)
(g)  continuous monitoring to timely detect and address any material risks.

List elements are ignored and presented in entirely new lines

Image
1.

2.

3.

4.

5.

6.

The assurance shall be conducted by an independent external service provider
(ESP) engaged by the financial institution.

The independent ESP must understand the proposed services, the data flows,
system architecture, connectivity as well as its dependencies.

The  independent  ESP  shall  review  the  comprehensiveness  of  the  risk
assessment performed by the financial institution and validate the adequacy of
the control measures implemented or to be implemented.

The Risk Assessment Report (as per Part D in Appendix 7) shall state among
others, the scope of review, risk assessment methodology, summary of findings
and remedial actions (if any).

PDF parsing will always be a PITA, but I think these issues can be addressed by tracking the locations of the elements, right now I feel it simply loops over the textual elements and uses simple algorithms to merge them together

@NikhilVerma NikhilVerma changed the title Markdown doesn't preserve text relationship Markdown doesn't preserve text relationship or indentation Dec 17, 2024
@NikhilVerma NikhilVerma changed the title Markdown doesn't preserve text relationship or indentation PDF to Markdown doesn't preserve text relationship or indentation Dec 17, 2024
@gagb gagb added enhancement New feature or request open for contribution Invites open-source developers to contribute to the project. labels Dec 17, 2024
@Utsav-Mehta
Copy link

Is anyone working on this?

@Utsav-Mehta
Copy link

@gagb is anyone working on this? Feel free to assign this to me!

@gagb gagb assigned gagb and Utsav-Mehta and unassigned gagb Feb 19, 2025
@gagb
Copy link
Contributor

gagb commented Feb 19, 2025

Assigned! Thanks for taking a look! We are a small team and this is an OSS project, so contributions are very welcome. Thanks again!

@Utsav-Mehta
Copy link

Utsav-Mehta commented Feb 19, 2025

Thanks for assigning @gagb . It's a pleasure, I am considering integrating vision models to handle relationships and indentation. Let me know if this sounds like a good starting point and if you have any suggestions!

@gagb
Copy link
Contributor

gagb commented Feb 19, 2025

Sounds like a good experiment. Creating a plugin would be a good first step. Recommend using open source vision models.

@Utsav-Mehta
Copy link

Thanks for the insights, will be creating a plugin first as you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.
Projects
None yet
Development

No branches or pull requests

3 participants