Python 爬虫实战:用 Requests + BeautifulSoup 抓取网页数据(附完整项目)

网络爬虫是 Python 最受欢迎的应用方向之一,无论是数据采集、舆情分析还是自动化办公,都离不开网页抓取。本文将介绍如何使用 requests + BeautifulSoup 构建一个简单高效的网页爬虫,并附带实战案例!


一、基础库介绍

1. requests:发送 HTTP 请求


bash

复制编辑

pip install requests

2. BeautifulSoup:解析 HTML 网页


bash

复制编辑

pip install beautifulsoup4


二、发送请求 + 获取网页源码


python

复制编辑

import requests url = "https://quotes.toscrape.com/" headers = { "User-Agent": "Mozilla/5.0" } response = requests.get(url, headers=headers) print(response.text[:500]) # 打印部分源码


三、使用 BeautifulSoup 解析 HTML


python

复制编辑

from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "html.parser") quotes = soup.find_all("div", class_="quote") for q in quotes: text = q.find("span", class_="text").get_text() author = q.find("small", class_="author").get_text() print(f"{text} —— {author}")

输出示例:


csharp

复制编辑

“The world as we have created it is a process of our thinking.” —— Albert Einstein “It is our choices, Harry, that show what we truly are...” —— J.K. Rowling


四、爬虫实战项目:抓取前 5 页的名言


python

复制编辑

def scrape_quotes(): base_url = "https://quotes.toscrape.com/page/{}/" all_quotes = [] for page in range(1, 6): res = requests.get(base_url.format(page)) soup = BeautifulSoup(res.text, "html.parser") for q in soup.select(".quote"): text = q.select_one(".text").get_text() author = q.select_one(".author").get_text() all_quotes.append((text, author)) return all_quotes # 保存为 txt 文件 with open("quotes.txt", "w", encoding="utf-8") as f: for text, author in scrape_quotes(): f.write(f"{text} —— {author}\n")


五、处理异常与反爬机制

1. 添加 Headers 防止被屏蔽


python

复制编辑

headers = { "User-Agent": "Mozilla/5.0", "Referer": "https://google.com" }

2. 使用 time.sleep 控制请求频率


python

复制编辑

import time for page in range(1, 6): ... time.sleep(1) # 延迟1秒


六、进阶方向推荐

功能推荐库
JS 渲染页面Selenium / Playwright
并发爬虫asyncio + aiohttp
IP 代理池requests + proxy
爬虫框架Scrapy / PySpider
反爬处理模拟登录 / 滑动验证码破解


七、总结

  • ✅ requests:简单、快速发送网页请求

  • ✅ BeautifulSoup:轻量级 HTML 解析利器

  • 🚫 适合小规模爬取,不推荐用作高并发爬虫

如果你是刚入门 Python 网络爬虫,这套组合足够应对 80% 的静态网站采集需求。

https://bigu.wang

https://www.bigu.wang

https://binm.wang

https://www.binm.wang

https://bint.wang

https://www.bint.wang

https://biop.wang

https://www.biop.wang

https://bits.wang

https://www.bits.wang

https://bjqb.wang

https://www.bjqb.wang

https://bjsm.wang

https://www.bjsm.wang

https://bleo.wang

https://www.bleo.wang

https://ono.wang

https://www.ono.wang

https://onz.wang

https://www.onz.wang

https://opo.wang

https://www.opo.wang

https://osm.wang

https://www.osm.wang

https://osn.wang

https://www.osn.wang

https://ovi.wang

https://www.ovi.wang

https://oxq.wang

https://www.oxq.wang

https://oti.wang

https://www.oti.wang

https://owu.wang

https://www.owu.wang

https://piq.wang

https://www.piq.wang

https://qmi.wang

https://www.qmi.wang

https://qki.wang

https://www.qki.wang

https://ref.wang

https://www.ref.wang

https://sak.wang

https://www.sak.wang

https://sar.wang

https://www.sar.wang

https://sfa.wang

https://www.sfa.wang

https://sfe.wang

https://www.sfe.wang

https://sgo.wang

https://www.sgo.wang

https://sku.wang

https://www.sku.wang

https://ycxjz.cn

https://www.ycxjz.cn

https://bnbmhomes.cn

https://www.bnbmhomes.cn

https://jinjianzuche.com

https://www.jinjianzuche.com

https://ahswt.cn

https://www.ahswt.cn

https://szwandaj.cn

https://www.szwandaj.cn

https://psbest.cn

https://www.psbest.cn

https://shanghai-arnold.cn

https://www.shanghai-arnold.cn

https://zgsscw.com

https://www.zgsscw.com

https://shxqth.cn

https://www.shxqth.cn

https://wdxj.cn

https://www.wdxj.cn

https://jad168.com

https://www.jad168.com

https://ultratrailms.cn

https://www.ultratrailms.cn

https://tztsjd.cn

https://www.tztsjd.cn

https://csqcbx.cn

https://www.csqcbx.cn

https://qazit.cn

https://www.qazit.cn

https://ahzjyl.cn

https://www.ahzjyl.cn

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值