Python 脚本，实现从给定的文件 URL 中提取文件名

小屁孩大帅-杨一凡

于 2025-05-09 18:01:34 发布

阅读量345

点赞数 5

CC 4.0 BY-SA版权

文章标签： python 数据库开发语言

本文链接：https://blog.csdn.net/weixin_41722928/article/details/147846836

下面是一个从文件 URL 中提取文件名的 Python 脚本，支持处理各种 URL 格式和边缘情况：

from urllib.parse import urlparse
import os
import re

def extract_filename(url, default_filename="file"):
    """
    从 URL 中提取文件名
    
    参数:
        url (str): 文件的 URL
        default_filename (str): 未找到文件名时的默认名称
    
    返回:
        str: 提取的文件名或默认名称
    """
    try:
        # 解析 URL
        parsed_url = urlparse(url)
        
        # 从路径部分提取文件名
        path = parsed_url.path
        
        # 处理查询参数中的文件名 (如 ?file=example.txt)
        if not path or path.endswith('/'):
            query = parsed_url.query
            filename_match = re.search(r'(?:^|&)file=([^&]+)', query)
            if filename_match:
                return filename_match.group(1)
        
        # 从路径中提取文件名
        filename = os.path.basename(path)
        
        # 验证文件名有效性
        if not filename or not re.search(r'[a-zA-Z0-9]', filename):
            return default_filename
        
        # 处理可能的 URL 编码
        try:
            filename = filename.encode('latin-1').decode('utf-8')
        except (UnicodeEncodeError, UnicodeDecodeError):
            pass
        
        # 移除可能的查询参数或片段
        filename = filename.split('?')[0].split('#')[0]
        
        return filename
    
    except Exception as e:
        print(f"提取文件名时出错: {e}")
        return default_filename

# 使用示例
if __name__ == "__main__":
    test_urls = [
        "https://example.com/files/report.pdf",
        "http://example.com/download?file=data.csv",
        "https://example.com/path/",
        "https://example.com",
        "ftp://server/file.txt",
        "https://example.com/file%20name.txt",
        "https://example.com/image.jpg?size=large#section1"
    ]
    
    for url in test_urls:
        filename = extract_filename(url)
        print(f"URL: {url}\n提取的文件名: {filename}\n")

这个脚本具有以下特性：