Skip to content

Titles and subtitles not recognized on docx documents #329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Vrobin0101 opened this issue Feb 12, 2025 · 2 comments
Open

Titles and subtitles not recognized on docx documents #329

Vrobin0101 opened this issue Feb 12, 2025 · 2 comments

Comments

@Vrobin0101
Copy link

Vrobin0101 commented Feb 12, 2025

I was trying to convert this file using Python API :

Image

and i have this output :

Default paragraph style

title

subtitle

# heading 1

## heading 2

### heading 3

#### heading 4

block quotation

preformated text

body text

normal

Is there a way to preserve at least titles and subtitles information ?

documents : test.docx

@adamdavidconn
Copy link

This is related to an issue with mammoth not converting headings correctly. mwilliamson/python-mammoth#153

This could be solved by utilising a custom plugin where you use the mammoth convert_to_markdown method (instead of converting to html which this package does). Or wait until there’s a fix for mammoth.

@RichardAffolter
Copy link

I fixed this by providing a style map to the convert() function.

from markitdown import MarkItDown
style_map = """
p[style-name='Title'] => h1:fresh
p[style-name='Subtitle'] => h2:fresh
"""
md = MarkItDown() 
result = md.convert("test.docx", style_map=style_map)
print(result.text_content)

which gives me the desired output

Default paragraph style

# title

## subtitle

# heading 1

## heading 2

### heading 3

#### heading 4

block quotation

preformated text

body text

normal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants