【Python 爬虫 CASE】使用Selenium+BeautifulSoup获取新闻网站文章列表

最新推荐文章于 2025-05-12 23:07:58 发布

原创最新推荐文章于 2025-05-12 23:07:58 发布 · 653 阅读

0 ·

CC 4.0 BY-SA版权

Python爬虫专栏收录该内容

8 篇文章

订阅专栏

本文介绍了一种使用Selenium和BeautifulSoup库从腾讯新闻网站抓取首页新闻标题和链接的方法，详细解释了如何绕过动态加载内容，获取所需数据并保存为Excel文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、需求

获取腾讯新闻网站(https://news.qq.com/)首页的新闻标题和列表
在这里插入图片描述 F12打开开发者工具，查看源代码

二、实现

Step1：获取网页源代码

如果使用requests库获取源代码

import requests
res = requests.get('http://news.qq.com/')

但是这种方式获取的源代码由于渲染，和实际查看到的不一致，因此，requests获取方式用不上，需要使用Selenium库的webdriver

from selenium import webdriver
driver=webdriver.Chrome()
driver.get('http://news.qq.com/')
#1.执行js命令
html=driver.execute_script("return document.documentElement.outerHTML")
#2.或者使用查找元素定位整个html文档
#html = driver.find_element_by_xpath("//*").get_attribute("outerHTML")
driver.close()

或者手动下载该页面的源代码，将其转换成字符串或文件样式

html=open(r'G:\temp files\qq.htm')

Step2：使用css选择器获取元素，并解析数据

from bs4 import BeautifulSoup
#传入源代码
soup = BeautifulSoup(html, 'html.parser')

#将文章标题和链接提取出来，存储到一个字典列表
newsary = []
for news in soup.select('.detail .""'):
    newsary.append({'title':news.select('a')[0].text, 'url':news.select('a')[0]['href']})

#构建一个dataframe，输出保存    
import pandas
newsdf = pandas.DataFrame(newsary)
newsdf.to_excel(r'G:\temp files\qqnews.xlsx')
nessdf

在这里插入图片描述