Skip to content

optional dependencies #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gagb opened this issue Dec 17, 2024 · 12 comments
Open

optional dependencies #103

gagb opened this issue Dec 17, 2024 · 12 comments
Assignees

Comments

@gagb
Copy link
Contributor

gagb commented Dec 17, 2024

          Unsure about this - perhaps should be an optional dep?

Originally posted by @casperdcl in #100 (comment)

@afourney
Copy link
Member

@gagb @casperdcl , Yeah, more generally, a lot of these should be optional dependencies.

Ideally we would have something like:

pip install markitdown[ocr, openai, yt_transcript]

Etc. to optionally include some of the more esoteric or heavy dependencies. We can then just include or exclude the converters accordingly.

What do you think?

@afourney afourney self-assigned this Dec 17, 2024
@gagb
Copy link
Contributor Author

gagb commented Dec 18, 2024

I like this but there is so much appeal to the simplicity from just running pip install markitdown

@casperdcl
Copy link

casperdcl commented Dec 18, 2024

Aliases are quite easy to implement...

pip install markitdown[all] could be made identical to pip install markitdown[ocr,llm,yt].

Pretty common Pythonicity.

btw you should probably rename this issue "optional dependencies" or similar. Also you can use a markdown quote block (>) rather than code block in the description 😉

@gagb gagb changed the title Unsure about this - perhaps should be an optional dep? optional dependencies Dec 18, 2024
@SigireddyBalasai
Copy link
Contributor

i think we can just make all the dependencies optional and make the script install dependencies if needed as the way it is happening in ultralytics there if a package is needed it will be installed on runtime also updates also work on runtime

@casperdcl
Copy link

casperdcl commented Dec 18, 2024

Whoa at most you could do:

try:
    import openai
except ImportError as exc:
    raise ImportError("please `pip/conda install openai` or `pip install markitdown[llm]`") from exc

Meanwhile side-effects like this are highly discouraged:

try:
    import openai
except ImportError:
    os.system(f"{sys.executable} -m pip imstall openai")
    import openai

@SigireddyBalasai
Copy link
Contributor

SigireddyBalasai commented Dec 18, 2024

i think something like

import sys
import pip
import pkg_resources

def check_and_install_module(module_name, check_for_updates=False):
    """
    Check if a Python module is installed. Optionally check and perform updates.
    
    Args:
        module_name (str): Name of the module to check and install
        check_for_updates (bool, optional): Whether to check and perform updates. Defaults to False.
    
    Returns:
        dict: A dictionary with installation/update status
    """
    try:
        # Try to import the module
        __import__(module_name)
        print(f"Module {module_name} is already installed.")
        
        # Check for updates if requested
        if check_for_updates:
            try:
                # Get current installed version
                current_version = pkg_resources.get_distribution(module_name).version
                
                # Check for available updates
                pip.main(['list', '--outdated'])
                
                # Perform update
                print(f"Updating {module_name}...")
                update_result = pip.main(['install', '--upgrade', module_name])
                
                if update_result == 0:
                    # Get new version after update
                    new_version = pkg_resources.get_distribution(module_name).version
                    print(f"Updated {module_name} from {current_version} to {new_version}")
                    return {
                        'installed': True, 
                        'updated': True, 
                        'old_version': current_version, 
                        'new_version': new_version
                    }
                else:
                    print(f"Failed to update {module_name}")
                    return {
                        'installed': True, 
                        'updated': False
                    }
            
            except Exception as update_error:
                print(f"Error checking/updating {module_name}: {update_error}")
                return {
                    'installed': True, 
                    'updated': False
                }
        
        return {
            'installed': True, 
            'updated': False
        }
    
    except ImportError:
        print(f"Module {module_name} not found. Attempting to install...")
        
        try:
            # Use pip to install the module
            install_result = pip.main(['install', module_name])
            
            if install_result == 0:
                # Verify the module is now importable
                __import__(module_name)
                print(f"Successfully installed {module_name}")
                return {
                    'installed': True, 
                    'updated': False
                }
            else:
                print(f"Failed to install {module_name}")
                return {
                    'installed': False, 
                    'updated': False
                }
        
        except Exception as e:
            print(f"Failed to install {module_name}. Error: {e}")
            return {
                'installed': False, 
                'updated': False
            }

# Example usage
if __name__ == "__main__":
    # Check and install pandas
    print(check_and_install_module('pandas'))
    
    # Check, install, and update requests
    print(check_and_install_module('requests', check_for_updates=True))

@robfitzgerald
Copy link

👍 to the idea of using optional dependencies. i wanted to try using markitdown as a global install (pip install -u markitdown) in my base python environment but when i did that, i got the whole kitchen sink:

Installing collected packages: pytz, pydub, puremagic, XlsxWriter, tzdata, speechrecognition, soupsieve, sniffio, pydantic-core, Pillow, pathvalidate, numpy, lxml, jiter, h11, et-xmlfile, defusedxml, cobble, annotated-types, youtube-transcript-api, python-pptx, pydantic, pandas, openpyxl, mammoth, httpcore, cryptography, beautifulsoup4, anyio, pdfminer-six, markdownify, httpx, openai, markitdown

there's a whole bunch of users out there that won't have the system privs to bring in this many dependencies (or will stay away because it doesn't make sense). it seems like it should be possible to install the minimum set for minimum stated functionality, "MarkItDown is a utility for converting various files to Markdown."

@Zahlii
Copy link

Zahlii commented Jan 22, 2025

This also increases exposure to a) supply chain attacks and b) CVEs in the whole repo.

Just today, I added markitdown to my repo running safety checks, and got hit with a CVE:

youtube-transcript-api (==0.6.2)  [1 vulnerability found]                      
  -> Vuln ID 74190:                                                             
     Affected versions of youtube_transcript_api are vulnerable to XML External
     Entity (XXE) Injection (CWE-611). T...                                     
Update youtube-transcript-api (==0.6.2) to youtube-transcript-api==0.6.3 to fix

Now here I can potentially just add a constraint on the dependency, but there will not always be "quick fixes", which prevents me from reliably using this library in anything production-grade. Additionally, when working with any kind of docker setup / container registry, every additional dependency and every additional MB translates to potentially a LOT of extra cost. Add to that that maybe sometimes I want to make sure that a video is not accidentally leaked to a 3rd party API when using markitdown?

@AdrianVollmer
Copy link

openai is only used in the tests. Markitdown is already using hatch, so it should create a test env and require openai there: https://hatch.pypa.io/1.12/config/environment/overview/#dependencies

Maybe there are even more dependencies, haven't checked in detail.

@afourney
Copy link
Member

Yeah I want to move to optional dependencies asap. Relatedly, the latest version in main (not PyPi) also supports 3rd party extensions, minimizing -- I hope -- the need for the kitchen sink.

So this is a known problem, and one I'm very keen to solve. The current status quo is a consequence of having lifted the code out of another (also experimental) project -- namely Magentic One. It hasn't yet been sufficiently generalized

@afourney
Copy link
Member

This also increases exposure to a) supply chain attacks and b) CVEs in the whole repo.

Just today, I added markitdown to my repo running safety checks, and got hit with a CVE:

youtube-transcript-api (==0.6.2)  [1 vulnerability found]                      
  -> Vuln ID 74190:                                                             
     Affected versions of youtube_transcript_api are vulnerable to XML External
     Entity (XXE) Injection (CWE-611). T...                                     
Update youtube-transcript-api (==0.6.2) to youtube-transcript-api==0.6.3 to fix

Now here I can potentially just add a constraint on the dependency, but there will not always be "quick fixes", which prevents me from reliably using this library in anything production-grade. Additionally, when working with any kind of docker setup / container registry, every additional dependency and every additional MB translates to potentially a LOT of extra cost. Add to that that maybe sometimes I want to make sure that a video is not accidentally leaked to a 3rd party API when using markitdown?

Yes, this keeps me up at night. I want to make a series of breaking changes for 0.0.2, and I will include a move to optional dependencies in that.

@afourney
Copy link
Member

afourney commented Mar 1, 2025

Folks, I have a potential design here. Let me know what you think -- before I expand it to all the converters we provide (this mechanism is not meant to be used with the plugin extensions.... for that, plugins are responsible for dependencies).

#1079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants