You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I use markitdown on a link to a README.md file on GitHub, it currently includes the entire header and footer navigation of GitHub's web interface, when I really only want the contents of the README.md. Should markitdown be able to convert the GitHub URL into a raw CDN URL and put it into the conversion pipeline, pass it through unaltered, or keep the current behavior?
The text was updated successfully, but these errors were encountered:
I'm not sure this is the responsibility of markitdown to do this and could open up a precedence for loads of url manipulations. I would suggest a little helper function in python to parse your input before passing them to the markitdown class. Here's a simple, untested, function you could use or start with.
deffetch_github_file_content(github_url):
raw_url=github_url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")
response=requests.get(raw_url)
ifresponse.status_code==200:
returnresponse.textelse:
print("Failed to fetch data. Status code:", response.status_code)
Uh oh!
There was an error while loading. Please reload this page.
If I use markitdown on a link to a README.md file on GitHub, it currently includes the entire header and footer navigation of GitHub's web interface, when I really only want the contents of the README.md. Should markitdown be able to convert the GitHub URL into a raw CDN URL and put it into the conversion pipeline, pass it through unaltered, or keep the current behavior?
The text was updated successfully, but these errors were encountered: