-
Notifications
You must be signed in to change notification settings - Fork 164
Fix Nested Table Conversion in HTML-to-Markdown Process #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes matthewwithanm#5 as well as an issue where `<p>foo<p><ul><li>bar</li></ul>` gets converted to `foo * bar` which is not correct
Add newline before and after a markdown list
remove prefixed and suffixed spaces from inline tags
nested lists did not work: after a nested list was over, a new line was inserted. this leads to a large gap before the rest of the parent list. lists are prefixed and suffixed with a single newline, this is now represented in the tests.
Add workflow for publishing to PyPI.
Create python-publish.yml
…ithanm#182) Signed-off-by: chrispy <[email protected]>
…atthewwithanm#183) Signed-off-by: chrispy <[email protected]>
…handling (matthewwithanm#184) Signed-off-by: chrispy <[email protected]>
Signed-off-by: chrispy <[email protected]>
Signed-off-by: chrispy <[email protected]>
Signed-off-by: chrispy <[email protected]>
Signed-off-by: chrispy <[email protected]>
Signed-off-by: chrispy <[email protected]>
…ng (matthewwithanm#202) Signed-off-by: chrispy <[email protected]>
…but header row is missing (matthewwithanm#203)
Signed-off-by: chrispy <[email protected]>
* add beautiful_soup_parser option * add Beautiful Soup parser argument to command line --------- Co-authored-by: Vincent Kelleher <[email protected]> Co-authored-by: AlexVonB <[email protected]>
Signed-off-by: chrispy <[email protected]>
@Wuhall - thank you for your contribution! However, I am not sure I agree with this change for the following reasons:
@AlexVonB, what do you think? |
@chrispy-snps thanks for your feedback! I completely understand the concern about keeping Markdown output clean and portable. Here’s a compromise that might address both sides:
I’d love to hear your thoughts on this approach! If acceptable, I can update the PR with these changes. |
It is important to note that this issue would directly impact the markdownify app launched from microsoft. Kinda makes it an important issue to resolve quickly especially since the user has solver most of the problem. Ref: microsoft/markitdown#1248 (comment) |
There is an unexpected conflict in my branch. I will fix this conflict and re-PR. Ref:220 |
Type: Bug Fix
Problem
The current
process_tag
function fails to properly handle nested tables during HTML-to-Markdown conversion, causing:Root Cause
• Standard Markdown doesn't support nested tables natively
• Existing implementation forced flattening of all table structures
• Child tables were being parsed as regular table cells
Solution
Modified the table conversion logic to:
Implementation Details
Key changes in
process_tag
:Test Case
Input HTML:
Previous Output (Broken):
Test
New Output (Correct):
Test
no
name
sex
pos
pos
1
1
1
1
-
--
--
ITa
11
2
-
--
---
ITb
1