用Python实现办公自动化（自动化处理PDF文件）

syblogs

已于 2024-04-08 16:14:59 修改

阅读量425

点赞数

CC 4.0 BY-SA版权

文章标签：自动化 pdf python

于 2024-03-28 16:17:55 首次发布

本文链接：https://blog.csdn.net/ysblogs/article/details/136911097

自动化处理 PDF 文件

谷歌浏览器 Chrome与浏览器驱动ChromeDriver安装

（一）批量下载 PDF 文件

1.使用Selenium模块爬取多页内容

2.使用Selenium模块下载PDF文件

3.使用urllib模块来进行网页的下载和保存

4.使用urllib&Selenium模块判断下载和保存

自动化处理 PDF 文件

Chrome：浏览器

Selenium：是一个用于浏览器自动化测试的工具集，是一个完整的自动化测试框架

WebDriver：是Selenium的一个关键组件，用于控制和操作浏览器

ChromeDriver：是Webdriver的一个实现，专门用于控制和操作Google Chrome浏览器

谷歌浏览器 Chrome与浏览器驱动ChromeDriver安装

Chrome 73 版本以后， ChromeDriver 和 Chrome 版本是一对一，版本号是一样的。

查看网址：Chrome for Testing availability

“安装路径展示”

（一）批量下载 PDF 文件

1.使用Selenium模块爬取多页内容

Eg：以下载巨潮资讯网的上市公司公告PDF文件为例。

“获取公告总数”

“获取[下一页]单击按钮”

“获取公告标题和网址”

"python程序完整代码"

在Selenium 4之后的版本中，由于引入了新的查找策略，原来的基于by_*方法的查找方式已经被弃用，需要使用新的方法。“find_element”配合By类来进行元素定位。

# 利用Selenium模块模拟鼠标单击"下一页"按钮
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re

# 1.获取公告总数和单页次数
browser = webdriver.Chrome()
url = "http://www.cninfo.com.cn/new/fulltextSearch?notautosubmit=&keyWord=理财"
browser.get(url)
time.sleep(5)
data = browser.page_source
p_count = '<span class="total-box" style="">约 (.*?) 条'
count = re.findall(p_count, data)[0]
pages = int(int(count) / 10)
# 2.用Selenium模块模拟单击”下一页“按钮
datas = []
datas.append(data)
for i in range(1):
    browser.find_element(
        By.XPATH,
        '//*[@id="fulltext-search"]/div[2]/div/div/div[3]/div[3]/div[2]/div/button[2]',
    ).click()
    time.sleep(3)
    data = browser.page_source
    datas.append(data)
    time.sleep(3)
# 3.将列表转换为字符串
alldata = "".join(datas)
browser.quit()
# 4.通过正则表达式提取公告标题和网址
p_title = '<span title="" class="r-title">(.*?)</a>'
p_href = '<a target="_blank" href="(.*?)".*?<span title='
# 5.将提取公告标题和网址的正则表达式应用到汇总了所有页面源代码的字符串变量alldata中
title = re.findall(p_title, alldata)
href = re.findall(p_href, alldata)
# 6.对爬取到的数据进行清洗工作
for i in range(len(title)):
    title[i] = re.sub("<.*?>", "", title[i])
    href[i] = "http://www.cninfo.com.cn" + href[i]
    href[i] = re.sub("amp;", "", href[i])
    print(str(i + 1) + "." + title[i])
    print(href[i])

"程序运行结果展示"

2.使用Selenium模块下载PDF文件

在搜索”理财“的结果网址：
http://www.cninfo.com.cn/new/fulltextSearch?notautosubmit=&keyWord=理财
中，单击任意一个公告标题，打开公告PDF文件的下载页面，网址变更为：
http://www.cninfo.com.cn/new/disclosure/detail?orgId=9900014267&announcementId=1219372722&announcementTime=2024-03-22
自动下载页面PDF文件，使用Selenium模块模拟单击页面中的”公告下载“

查看源码，右键获取“公告下载”按钮的XPath内容：
//*[@id="noticeDetail"]/div/div[1]/div[3]/div[1]/button
"
文件存在危险，因此 Chrome 已将其拦截

"