Image links not extracted correctly, resulting in empty `![]()` in Markdown output #1177

KevinChen1994 · 2025-04-09T11:06:27Z

Description

When using markitdown to convert a WeChat article to Markdown, the image URLs are not being extracted properly. The resulting Markdown output contains image tags like ![]() with empty URLs, causing the images to be missing from the final content.

Steps to Reproduce

from markitdown import MarkItDown
import requests

md = MarkItDown()

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36",
    "Accept": "application/json",
}

response = requests.get(
    "https://mp.weixin.qq.com/s/85a2235XkZPOevXW9HTXtg", headers=headers
)

result = md.convert(response)
print(result.text_content)

Actual Output

Star 从2月开始，加速增长：
![]()
微信指数，从2月开始，出现流量突增：
![]()

Expected Output

Image URLs should be correctly extracted and included in the Markdown output, e.g.:

![](https://mmbiz.qpic.cn/...)

Environment

OS: macOS

Python version: 3.12
markitdown version: latest

Additional Notes
It seems that the indicates the Markdown syntax is being applied, but the image URLs are missing. This may be caused by changes in the structure of WeChat article pages, or possibly a case that isn’t currently supported.

Would appreciate any help or fix for this — thanks for the great tool!

The text was updated successfully, but these errors were encountered:

WangYuhang-CN · 2025-04-11T09:04:13Z

I reviewed the URL you provided and found that some images on the webpage are embedded using the data-src attribute instead of the standard src. For example:

<img class="rich_pages wxw-img" data-imgfileid="100064212" data-ratio="0.6703703703703704" data-src="https://mmbiz.qpic.cn/mmbiz_png/Z6bicxIx5naLYpcX77tOy3epFKUtjFwCHjnRWPw44V36TFFXSoEibSFichcpMZ6O88a4d9iaYyKrINDJAKpwfHicRCg/640?wx_fmt=png&amp;from=appmsg" data-type="png" data-w="1080" style="height: auto !important;" width="1664"/>

The HTML parsing library is a customized version based on markdownify. Its default convert_img function only processes the src attribute. As a result, images that use the data-src attribute are not parsed correctly.

To address this issue, considering the highly dynamic nature of HTML content, it may be necessary to implement a custom parsing function to support such cases.

Temporary Workaround

A monkey patch can be applied to replace the convert_img function to support data-src. However, this is only a temporary solution.

Code:

from markitdown.converters._markdownify import _CustomMarkdownify

def monkey_patch_convert_img(self, el, text, convert_as_inline=False, **kwargs):
    alt = el.attrs.get("alt", None) or ""
    src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
    title = el.attrs.get("title", None) or ""
    title_part = ' "%s"' % title.replace('"', r"\"") if title else ""

    if (
            convert_as_inline
            and el.parent.name not in self.options["keep_inline_images_in"]
    ):
        return alt

    # Remove dataURIs
    if src.startswith("data:") and not self.options["keep_data_uris"]:
        src = src.split(",")[0] + "..."

    return "![%s](%s%s)" % (alt, src, title_part)


_CustomMarkdownify.convert_img = monkey_patch_convert_img

Core Logic

src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""

When the src attribute is not present, fallback to the data-src attribute to retrieve the image URL.

KevinChen1994 · 2025-04-21T06:34:57Z

Thanks a lot, it helps!

Noah-Zhuhaotian linked a pull request Apr 30, 2025 that will close this issue

Adding support for data-src Attribute #1226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Image links not extracted correctly, resulting in empty `![]()` in Markdown output #1177

Image links not extracted correctly, resulting in empty `![]()` in Markdown output #1177

KevinChen1994 commented Apr 9, 2025 •

edited

Loading

WangYuhang-CN commented Apr 11, 2025 •

edited

Loading

Uh oh!

KevinChen1994 commented Apr 21, 2025

Uh oh!

Image links not extracted correctly, resulting in empty ![]() in Markdown output #1177

Image links not extracted correctly, resulting in empty ![]() in Markdown output #1177

Comments

KevinChen1994 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Steps to Reproduce

Actual Output

Expected Output

Environment

WangYuhang-CN commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Temporary Workaround

Code:

Core Logic

Uh oh!

KevinChen1994 commented Apr 21, 2025

Uh oh!

Image links not extracted correctly, resulting in empty `![]()` in Markdown output #1177

Image links not extracted correctly, resulting in empty `![]()` in Markdown output #1177

KevinChen1994 commented Apr 9, 2025 •

edited

Loading

WangYuhang-CN commented Apr 11, 2025 •

edited

Loading