Skip to content

Fix Nested Table Conversion in HTML-to-Markdown Process #219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 252 commits into from

Conversation

Wuhall
Copy link

@Wuhall Wuhall commented May 14, 2025

Type: Bug Fix

Problem
The current process_tag function fails to properly handle nested tables during HTML-to-Markdown conversion, causing:

  1. Complete loss of nested table structures
  2. Flattened output that breaks document hierarchy
  3. Unreadable Markdown table syntax

Root Cause
• Standard Markdown doesn't support nested tables natively

• Existing implementation forced flattening of all table structures

• Child tables were being parsed as regular table cells

Solution
Modified the table conversion logic to:

  1. Detect nested table structures during HTML parsing
  2. Preserve inner tables as raw HTML blocks
  3. Convert outer tables to standard Markdown syntax
  4. Maintain proper whitespace and formatting

Implementation Details
Key changes in process_tag:

def process_tag(node):
    # **Handle nested tables**
    if node.name == 'table' and 'table' in parent_tags:
        # If this table is nested within another table, return its HTML representation
        return str(node)

Test Case
Input HTML:

<p> </p>
<p> Test</p>
<table>
    <tr>
        <td>黄河</td>
    </tr>
    <tr>
        <td>长江</td>
    </tr>
    <tr>
        <td>
            <p>七、team</p>
            <table>
                <tr>
                    <td>
                        <p>no</p>
                    </td>
                    <td>
                        <p>name</p>
                    </td>
                    <td>
                        <p>sex</p>
                    </td>
                    <td>
                        <p>pos</p>
                    </td>
                    <td>
                        <p>pos</p>
                    </td>
                    <td>
                        <p>1</p>
                    </td>
                    <td>
                        <p>1</p>
                    </td>
                    <td>
                        <p>1</p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p>1</p>
                    </td>
                    <td>
                        <p>**</p>
                    </td>
                    <td>
                        <p>-</p>
                    </td>
                    <td>
                        <p>--</p>
                    </td>
                    <td></td>
                    <td>
                        <p>--</p>
                    </td>
                    <td>
                        <p>ITa</p>
                    </td>
                    <td>
                        <p>11</p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p>2</p>
                    </td>
                    <td>
                        <p>**</p>
                    </td>
                    <td>
                        <p>-</p>
                    </td>
                    <td>
                        <p>--</p>
                    </td>
                    <td></td>
                    <td>
                        <p>---</p>
                    </td>
                    <td>
                        <p>ITb</p>
                    </td>
                    <td>
                        <p>1</p>
                    </td>
                </tr>
            </table>
        </td>
    </tr>
    <tr>
        <td>
            <p>11:</p>
            <p>22: </p>
            <p> </p>
            <p> date:</p>
        </td>
    </tr>
</table>

Previous Output (Broken):

Test

黄河
长江
七、team
11: 22: date:

New Output (Correct):

Test

黄河
长江
七、team

no

name

sex

pos

pos

1

1

1

1

-

--

--

ITa

11

2

-

--

---

ITb

1

11: 22: date:

skoczen and others added 30 commits November 28, 2017 12:07
Fixes matthewwithanm#5 as well as an issue where `<p>foo<p><ul><li>bar</li></ul>` gets converted to `foo * bar` which is not correct
Add newline before and after a markdown list
remove prefixed and suffixed spaces from inline tags
nested lists did not work: after a nested list was over,
a new line was inserted. this leads to a large gap before
the rest of the parent list.

lists are prefixed and suffixed with a single newline,
this is now represented in the tests.
Add workflow for publishing to PyPI.
chrispy-snps and others added 22 commits January 27, 2025 11:55
Signed-off-by: chrispy <[email protected]>
Signed-off-by: chrispy <[email protected]>
* add beautiful_soup_parser option
* add Beautiful Soup parser argument to command line

---------

Co-authored-by: Vincent Kelleher <[email protected]>
Co-authored-by: AlexVonB <[email protected]>
@chrispy-snps
Copy link
Collaborator

@Wuhall - thank you for your contribution! However, I am not sure I agree with this change for the following reasons:

  • Many (most?) users want to convert the HTML input entirely to Markdown, with no HTML remaining.
  • Table cells can contain a variety of structured content beyond tables - tables, lists, definition lists, preformatted blocks, blockquotes.
  • It is easy to subclass Markdownify to get the specific behavior you need for your use case.

@AlexVonB, what do you think?

@Wuhall
Copy link
Author

Wuhall commented May 17, 2025

@chrispy-snps thanks for your feedback! I completely understand the concern about keeping Markdown output clean and portable. Here’s a compromise that might address both sides:

  1. User-configurable behavior:
    • Add an optional parameter (e.g., nested_tables=False) to allow users to explicitly opt-in for nested table conversion.

    • Default behavior remains unchanged (strip nested tables), ensuring backward compatibility.

  2. Why this helps:
    • Many users do need nested structures (e.g., Confluence/Notion exports, technical docs), and currently, they lose data silently.

    • A subclass workaround is possible, but it’s less discoverable than a built-in option.

  3. Implementation flexibility:
    • The conversion could either:

    ◦ Preserve nested tables as HTML (if nested_tables=True), or

    ◦ Use indented Markdown pseudo-tables (if feasible).

I’d love to hear your thoughts on this approach! If acceptable, I can update the PR with these changes.

@ayushxx7
Copy link

It is important to note that this issue would directly impact the markdownify app launched from microsoft. Kinda makes it an important issue to resolve quickly especially since the user has solver most of the problem. Ref: microsoft/markitdown#1248 (comment)

@Wuhall
Copy link
Author

Wuhall commented May 21, 2025

There is an unexpected conflict in my branch. I will fix this conflict and re-PR. Ref:220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.