网络爬虫是 Python 最受欢迎的应用方向之一,无论是数据采集、舆情分析还是自动化办公,都离不开网页抓取。本文将介绍如何使用 requests + BeautifulSoup
构建一个简单高效的网页爬虫,并附带实战案例!
一、基础库介绍
1. requests:发送 HTTP 请求
bash
复制编辑
pip install requests
2. BeautifulSoup:解析 HTML 网页
bash
复制编辑
pip install beautifulsoup4
二、发送请求 + 获取网页源码
python
复制编辑
import requests url = "https://quotes.toscrape.com/" headers = { "User-Agent": "Mozilla/5.0" } response = requests.get(url, headers=headers) print(response.text[:500]) # 打印部分源码
三、使用 BeautifulSoup 解析 HTML
python
复制编辑
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, "html.parser") quotes = soup.find_all("div", class_="quote") for q in quotes: text = q.find("span", class_="text").get_text() author = q.find("small", class_="author").get_text() print(f"{text} —— {author}")
输出示例:
csharp
复制编辑
“The world as we have created it is a process of our thinking.” —— Albert Einstein “It is our choices, Harry, that show what we truly are...” —— J.K. Rowling
四、爬虫实战项目:抓取前 5 页的名言
python
复制编辑
def scrape_quotes(): base_url = "https://quotes.toscrape.com/page/{}/" all_quotes = [] for page in range(1, 6): res = requests.get(base_url.format(page)) soup = BeautifulSoup(res.text, "html.parser") for q in soup.select(".quote"): text = q.select_one(".text").get_text() author = q.select_one(".author").get_text() all_quotes.append((text, author)) return all_quotes # 保存为 txt 文件 with open("quotes.txt", "w", encoding="utf-8") as f: for text, author in scrape_quotes(): f.write(f"{text} —— {author}\n")
五、处理异常与反爬机制
1. 添加 Headers 防止被屏蔽
python
复制编辑
headers = { "User-Agent": "Mozilla/5.0", "Referer": "https://google.com" }
2. 使用 time.sleep 控制请求频率
python
复制编辑
import time for page in range(1, 6): ... time.sleep(1) # 延迟1秒
六、进阶方向推荐
功能 | 推荐库 |
---|---|
JS 渲染页面 | Selenium / Playwright |
并发爬虫 | asyncio + aiohttp |
IP 代理池 | requests + proxy |
爬虫框架 | Scrapy / PySpider |
反爬处理 | 模拟登录 / 滑动验证码破解 |
七、总结
-
✅ requests:简单、快速发送网页请求
-
✅ BeautifulSoup:轻量级 HTML 解析利器
-
🚫 适合小规模爬取,不推荐用作高并发爬虫
如果你是刚入门 Python 网络爬虫,这套组合足够应对 80% 的静态网站采集需求。
https://bigu.wang
https://www.bigu.wang
https://binm.wang
https://www.binm.wang
https://bint.wang
https://www.bint.wang
https://biop.wang
https://www.biop.wang
https://bits.wang
https://www.bits.wang
https://bjqb.wang
https://www.bjqb.wang
https://bjsm.wang
https://www.bjsm.wang
https://bleo.wang
https://www.bleo.wang
https://ono.wang
https://www.ono.wang
https://onz.wang
https://www.onz.wang
https://opo.wang
https://www.opo.wang
https://osm.wang
https://www.osm.wang
https://osn.wang
https://www.osn.wang
https://ovi.wang
https://www.ovi.wang
https://oxq.wang
https://www.oxq.wang
https://oti.wang
https://www.oti.wang
https://owu.wang
https://www.owu.wang
https://piq.wang
https://www.piq.wang
https://qmi.wang
https://www.qmi.wang
https://qki.wang
https://www.qki.wang
https://ref.wang
https://www.ref.wang
https://sak.wang
https://www.sak.wang
https://sar.wang
https://www.sar.wang
https://sfa.wang
https://www.sfa.wang
https://sfe.wang
https://www.sfe.wang
https://sgo.wang
https://www.sgo.wang
https://sku.wang
https://www.sku.wang
https://ycxjz.cn
https://www.ycxjz.cn
https://bnbmhomes.cn
https://www.bnbmhomes.cn
https://jinjianzuche.com
https://www.jinjianzuche.com
https://ahswt.cn
https://www.ahswt.cn
https://szwandaj.cn
https://www.szwandaj.cn
https://psbest.cn
https://www.psbest.cn
https://shanghai-arnold.cn
https://www.shanghai-arnold.cn
https://zgsscw.com
https://www.zgsscw.com
https://shxqth.cn
https://www.shxqth.cn
https://wdxj.cn
https://www.wdxj.cn
https://jad168.com
https://www.jad168.com
https://ultratrailms.cn
https://www.ultratrailms.cn
https://tztsjd.cn
https://www.tztsjd.cn
https://csqcbx.cn
https://www.csqcbx.cn
https://qazit.cn
https://www.qazit.cn
https://ahzjyl.cn
https://www.ahzjyl.cn