Skip to content

Nested tables in DOCX are lost when converting to Markdown #1248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Wuhall opened this issue May 14, 2025 · 1 comment
Open

Nested tables in DOCX are lost when converting to Markdown #1248

Wuhall opened this issue May 14, 2025 · 1 comment

Comments

@Wuhall
Copy link
Contributor

Wuhall commented May 14, 2025

Description:
When converting a DOCX file containing nested tables to Markdown using markitdown, the inner table content is discarded in the output. This occurs consistently with specific document structures.

Steps to Reproduce:

  1. Environment:
    • Device: MacBook Pro with M3 chip

    • Installation:

    pip install -e 'packages/markitdown[all]'
  2. Test File:
    • [Attach a minimal DOCX file with nested tables (e.g., outer table → inner table → text)].

  3. Command:

    markitdown path-to-file.docx > document.md
  4. Observed Result:
    • Outer table structure is preserved, but inner table content is missing in document.md.

  5. Expected Result:
    • Both outer and inner tables should be rendered in Markdown (e.g., as nested HTML tables or flattened Markdown).

Image
@Wuhall
Copy link
Contributor Author

Wuhall commented May 14, 2025

I've reviewed the conversion pipeline and identified an issue with nested table handling:

Current Behavior:

  1. DOCX → HTML conversion works correctly (preserves nested tables)
  2. HTML → Markdown conversion using markdownify fails to properly handle nested table structures

Problem:
markdownify flattens nested tables into single-level Markdown tables

• This causes:

• Loss of table hierarchy

• Misaligned columns

• Broken formatting in complex documents

I have submitted a PR in the markdownify,If you encounter similar problems, you can add code here as follows
markdownify/init.py

def process_tag(self, node, parent_tags=None):
        # **Handle nested tables**
        if node.name == 'table' and 'table' in parent_tags:
            # If this table is nested within another table, return its HTML representation
            return str(node)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant