爬虫——爬虫理论+request模块

原创

于 2024-10-01 17:22:02 发布 · 2.7k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫

一、爬虫理论

爬虫——请求网站并提取数据的自动化程序

网络爬虫（又被称为网页蜘蛛，网络机器人）就是模拟客户端发送网络请求，接收请求响应，一种按照一定的规则，自动的抓取互联网信息的程序。

原则上，只要是浏览器（客户端）能做的事情，爬虫都能够做，也就是说万物皆可爬，可视即可爬

爬虫能抓取拿些数据？

- 网页文本

- 图片

- 视频，音频

- 其他（只要能请求到的就意味着都能获取到）

二、request模块

作用：发送网络请求，获得响应数据

官方文档：https://requests.readthedocs.io/zh_CN/latest/index.html

Requests是用python语言基于urllib编写的，采用的是Apache2 Licensed开源协议的HTTP库

它比urllib更加方便，可以节约大量的工作，完全满足HTTP测试需求的库

⼀句话——Requests是一个Python代码编写的HTTP请求库，方便在代码中模拟浏览器发送http请求a

安装命令
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple

1.Requests请求

# https://www.baidu.com/
import requests
response = requests.get('https://www.baidu.com/')
print(response) # 响应体对象（响应源码+响应状态码+响应URL）
print(response.text) # 查看响应体内容


print(type(response.text)) # 查看响应内容的数据类型


print(response.status_code) # 查看响应状态码
print(response.url) # 查看响应url

各种请求方式

import requests
requests.get('http://httpbin.org/get') # GET请求
requests.post('http://httpbin.org/post') # POST请求
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')

1.基于get请求

# 第一种写法
#https://www.baidu.com/s?wd=%E9%97%B9%E9%97%B9&base_query=%E9%97%B9%E9%97%B9&pn=10&oq=%E9%97%B9%E9%97%B9&ie=utf-8&usm=1&rsv_pq=a4f4e52200027b13&rsv_t=82600eHOUMYEzX16IwoPl%2BnK%2FnzM6jy5R9dF