-
Notifications
You must be signed in to change notification settings - Fork 3k
Image links not extracted correctly, resulting in empty ![]()
in Markdown output
#1177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I reviewed the URL you provided and found that some images on the webpage are embedded using the data-src attribute instead of the standard src. For example: <img class="rich_pages wxw-img" data-imgfileid="100064212" data-ratio="0.6703703703703704" data-src="https://mmbiz.qpic.cn/mmbiz_png/Z6bicxIx5naLYpcX77tOy3epFKUtjFwCHjnRWPw44V36TFFXSoEibSFichcpMZ6O88a4d9iaYyKrINDJAKpwfHicRCg/640?wx_fmt=png&from=appmsg" data-type="png" data-w="1080" style="height: auto !important;" width="1664"/> The HTML parsing library is a customized version based on To address this issue, considering the highly dynamic nature of HTML content, it may be necessary to implement a custom parsing function to support such cases. Temporary WorkaroundA monkey patch can be applied to replace the Code:from markitdown.converters._markdownify import _CustomMarkdownify
def monkey_patch_convert_img(self, el, text, convert_as_inline=False, **kwargs):
alt = el.attrs.get("alt", None) or ""
src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
title = el.attrs.get("title", None) or ""
title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
if (
convert_as_inline
and el.parent.name not in self.options["keep_inline_images_in"]
):
return alt
# Remove dataURIs
if src.startswith("data:") and not self.options["keep_data_uris"]:
src = src.split(",")[0] + "..."
return "" % (alt, src, title_part)
_CustomMarkdownify.convert_img = monkey_patch_convert_img Core Logicsrc = el.attrs.get("src", None) or el.attrs.get("data-src", None) or "" When the |
Thanks a lot, it helps! |
Uh oh!
There was an error while loading. Please reload this page.
Description
When using
markitdown
to convert a WeChat article to Markdown, the image URLs are not being extracted properly. The resulting Markdown output contains image tags like![]()
with empty URLs, causing the images to be missing from the final content.Steps to Reproduce
Actual Output
Expected Output
Image URLs should be correctly extracted and included in the Markdown output, e.g.:
Environment
OS: macOS
Python version: 3.12
markitdown version: latest
Additional Notes
indicates the Markdown syntax is being applied, but the image URLs are missing. This may be caused by changes in the structure of WeChat article pages, or possibly a case that isn’t currently supported.
It seems that the
Would appreciate any help or fix for this — thanks for the great tool!
The text was updated successfully, but these errors were encountered: