爬虫:
模拟浏览器自动抓取网页信息的脚本
主要用到浏览器自带的抓包功能,request模块,beaufulsoup模块和re模块
一.伪装
1.进行伪装的原因
import requests
url='http://www.baidu.com'
header={'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 Edg/94.0.992.38'}
response=requests.get(url)
response1=requests.get(url,headers=header)
print(len(response.content.decode()))
print(len(response1.content.decode()))
可以看出当不进行伪装时,我们能获取的信息长度只有2287,而当我们进行伪装后,我们能获取的信息长度为295758
2.请求头heders
headers为字典形式,一般构造headers需要cookie和User-Agent两个
import requests
url='https://github.com/Khazing'
header={'Cookie':'_octo=GH1.1.1409507418.1634466346; _device_id=657e29e120e5f4c50fd8f575dc1651eb; user_session=0cgOhLVHLt1AQzVFHGVYnBPb0yDinMqmr0PNWSNjZfSAY9ww; __Host-user_session_same_site=0cgOhLVHLt1AQzVFHGVYnBPb0yDinMqmr0PNWSNjZfSAY9ww; logged_in=yes; dotcom_user=Khazing; has_recent_activity=1; color_mode=%7B%22color_mode%22%3A%22auto%22%2C%22light_theme%22%3A%7B%22name%22%3A%22light%22%2C%22color_mode%22%3A%22light%22%7D%2C%22dark_theme%22%3A%7B%22name%22%3A%22dark%22%2C%22color_mode%22%3A%22dark%22%7D%7D; _gh_sess=RjgJAxoXBWueK4CHLzr7hmxOC%2BW0GSlqophwVtis534lsLS%2FN3PZ6eeBcmrcIstWJCxyKFCu51v3muGPySlsP%2BrCPuTi%2Bl%2BfbKVKWSA6UeyXWm3PnLnGo6hQz1GRf1MsZ5fGGb8%2BBRdQmM9NmBzq9dx0Y9PDwjO1j160tc9yrb2euaiP4B%2Bp%2BsuCo7X9MoId--bxhgtt5nehk5Qc%2FH--2qKYAo1TVPp0HgMBpQPuqw%3D%3D'
,'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 Edg/94.0.992.38'}
response1=requests.get(url,headers=header)
User-Agent
向服务器说明是pc/Android发起的请求
cookies
能够保持登录状态,从而可以爬取基于某个用户的信息
使用方法:先携带cookie登录后,转而使用session方法
session=ruquests.session()
response=session.get()
response=session.post()
3.参数params
params为字典形式,值为下图中的查询字符串参数
4.代理proxies
proxies是为了防止服务器由于同一个IP不断发起请求,认为是爬虫而封杀的属性
代理的种类和使用:
proxies为字典形式,值为
import requests
url='http://youtube.com'
Proxy={
'http':'http://103.138.164.106:80',
'https':'https://61.133.87.228:55443'
}
response=requests.get(url,proxies=Proxy)
5.时间间隔
如果不间断地重复发起请求,很有可能被服务器认为是爬虫,毕竟如果是人不可能不间断地重复发起请求,所以需要设置时间间隔来伪装
import time
#n为间隔时间
time.sleep(n)
二.模拟登陆
验证码:
1.普通验证码
爬取验证码图片→调用网上识别验证码接口→得到验证码
2.滑动,点击验证码
爬取验证码图片→调用网上识别验证码接口→模拟浏览器自动化滑动,点击(需要位置参数)
selenium模块:浏览器自动化模块,与爬虫的关系:便捷获取动态加载数据和便捷实现模拟登陆
三.发起请求获取页面源代码
有两种方式get和post
get请求的参数一般在浏览器地址中就能找到或者在包里查看查询字符串,一般都用get请求
post请求的参数不在浏览器地址中显示,在包里查看表单数据,登录时用post请求
发起请求后响应的数据类型
#此时response为
response = requests.get(url = url,headers=UserAgent.get_headers())
#此时response是str型的页面源代码
response = requests.get(url = url,headers=UserAgent.get_headers()).text
#此时response是bytes型的页面源代码
response = requests.get(url = url,headers=UserAgent.get_headers()).content
注意事项
1.动态加载的数据
页面源代码与F12中的元素的内容可能是不同的,因为很多数据都是通过动态加载的
通过requests模块获得的是页面源代码就不包含动态加载的数据,而F12中的元素内容是经过浏览器渲染过后的产物包含了动态加载的数据.
但是要爬取的资源往往是动态加载的,所以每次爬取之前都需要先确认爬取的资源是否是动态加载的
例子:
F12中有mp4资源
但在源代码中却没有找到mp4
这时的方法在抓的包中搜索MP4
但是要注意此时json中的地址并不一定是真正的地址,打开json中的地址
打开F12中的地址
所以我们可以获取到的是json中的地址,但是真正的地址其实是F12中的,所以需要将json中的地址构造成真正的地址
2.乱码
处理方法
response = requests.get(url=url,headers=UserAgent.get_headers())
page_text = response.text.encode('iso-8859-1').decode('gbk')
3.分页
(1)for循环:直接根据每一页地址构造
(2)爬取页码和下一页地址
(3)selenium模块:直接翻页
四.资源定位
1.正则
得到页面源码→构造正则→提取资源url列表
提取标签中的内容
#re.sub(正则字符串,替代字符串,被替代字符串, (count=n n为最大替代次数), (flags=) 标志位,用于控制正则表达式的匹配方式)
content = re.sub('<(\S*?)[^>]*>.*?|<.*? />','',content_text)
2.beautifulsoup
实例化对象→定位→得到资源url列表
五.请求URL爬取资源
1.单进程
就是平常用的,执行下一个任务必须等待上一个任务的完成
2.多进程
from multiprocessing.dummy import Pool
def get_content(content_url):
content = requests.get(url=img_src,headers=headers).content
with open(path,'wb') as file:
file.write(content)
print(title+"下载完毕")
time.sleep(2)
file.close()
content_url_list = []#将内容的url存到这个列表中
#实例化一个Pool对象
pool = Pool(n)#n为进程个数
pool.map(get_content,content_url_List)
关于这个模块可以看这个转载:Python 多进程 multiprocessing.Pool类详解_另一个自己-CSDN博客
六.存储
注意写入的数据应为bytes型
.encode()将数据编码为bytes型
.decode()将数据解码为str型
还有存储时文件的名称图片jpg/png,文章txt,视频mp4
起因:前几天想找壁纸的wallhaven那个网站,好久没去了,结果发现图片预览显示不出来,看F12好像是js被劫持了,弄了个脚本才可以显示,然后开始找,又觉得一张一张保存好麻烦,就整了爬虫,感觉这玩意确实方便,也很好玩.
代码:
import requests
import UserAgent
import os
from bs4 import BeautifulSoup
import time
from multiprocessing.dummy import Pool
import datetime
starttime = datetime.datetime.now()
def get_content(content_url):
img_name = content_url[31:]
path = file_name+'/'+img_name
content = requests.get(url=content_url,headers=UserAgent.get_headers()).content
with open(path,'wb') as file:
file.write(content)
print(img_name+"下载完毕!")
file.close()
os.chdir ('E:/壁纸')#文件路径
file_name = 'E:/壁纸/宫崎骏'#文件夹名
if not os.path.exists(file_name):
os.mkdir(file_name)
for page_num in range(1,20):
img_srcl_ist=[]
#url=爬取的榜单链接(到page=)+str(page_num)
url = 'https://wallhaven.cc/search?q=id%3A1748&categories=110&purity=100&sorting=favorites&order=desc&page='+str(page_num)
response = requests.get(url = url,headers=UserAgent.get_headers())
page_text = response.text
soup = BeautifulSoup(page_text,'lxml')
img_src_list = soup.select('.thumb-listing-page > ul >li >figure>img ')
for li in img_src_list:
src = li.attrs['data-src']
key = src.split('/')
img_src = 'https://w.wallhaven.cc/full/'+key[4]+'/'+'wallhaven-'+key[5][:-3]+'png'
img_response = requests.get(url=img_src,headers=UserAgent.get_headers())
img_data = requests.get(url=img_src,headers=UserAgent.get_headers()).content
if img_response.status_code == 404:
img_src2 = 'https://w.wallhaven.cc/full/'+key[4]+'/'+'wallhaven-'+key[5][:-3]+'jpg'
img_srcl_ist.append(img_src2)
else:
img_srcl_ist.append(img_src)
pool = Pool(4)#n为进程个数
pool.map(get_content,img_srcl_ist)
pool.close()#关闭进程池,不再接受新的进程
pool.join()#主进程阻塞等待子进程的退出
endtime = datetime.datetime.now()
print (endtime , starttime)
#5s一张
#UserAgent.py文件
import random
user_agent = [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UCWEB7.0.2.37/28/999",
"NOKIA5700/ UCWEB7.0.2.37/28/999",
"Openwave/ UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
# iPhone 6
"Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
# 新版移动ua
"Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
]
# 随机获取一个请求头
def get_headers():
return {'User-Agent': random.choice(user_agent)}