PyMySQL基础

最新推荐文章于 2025-08-18 23:17:42 发布

停停的茶

最新推荐文章于 2025-08-18 23:17:42 发布

阅读量468

点赞数 6

CC 4.0 BY-SA版权

文章标签： python mysql html

本文链接：https://blog.csdn.net/ylqw258369/article/details/149753009

一、PyMySQL基础

1. 简介：是Python3中连接MySQL的库，实现Python数据库API v2.0，目标是替代Python2中的MySQLdb。支持关系型数据库（如MySQL、SQLite等）的操作，核心用于将爬虫数据存储到MySQL。

2. 安装：命令：pip install pymysql -i https://pypi.tuna.tsinghua.edu.cn/simple

3. 核心使用步骤

a. 创建数据库连接对象（connect）：通过pymysql.connect()设置参数（主机、端口、用户、密码、数据库名、字符集等）。

b. 获取游标对象（cursor）：通过connect.cursor()创建，用于执行SQL语句。

c. 执行SQL语句：通过cursor.execute()执行DDL、DML等语句（如增删改查）。

d. 提交事务并关闭连接：通过connect.commit()提交事务（失败则rollback()回滚），最后关闭游标（cursor.close()）和连接（connect.close()）。

4. 关键对象方法

connect对象：cursor()（创建游标）、commit()（提交事务）、rollback()（回滚）、close()（关闭连接）。

cursor对象：execute()（执行SQL）、fetchone()/fetchmany()/fetchall()（获取结果）、rowcount()（返回数据行数）、close()（关闭游标）。

二、MySQL表操作

1.创建表:通过CREATE TABLE语句定义表结构，需指定字段名、类型、约束（如主键、非空）。

示例：创建students表，包含id（学号，主键）、name（姓名）、age（年龄）。

import pymysql

# 连接数据库
db = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='111111',改为自己的密码
    port=3306,
    db="vb" 改为自己的数据库
)
cursor = db.cursor()

# 创建students表（若不存在）
sql = 'create table if not exists students(id varchar(255) not null, name varchar(255), age int not null, primary key(id))'
cursor.execute(sql)  # 执行创建表SQL
db.close()  # 关闭数据库连接

2. 插入数据

- 基础方式：用INSERT语句+占位符%s传递参数，配合cursor.execute(sql, val)。

- 动态字典插入：通过字典存储字段名和值，自动生成SQL语句，适应字段新增需求，无需修改SQL。

- 事务保障：插入属于事务操作，需通过try-except捕获异常，成功commit()，失败rollback()，保障ACID特性（原子性、一致性、隔离性、持久性）。

向表内插入数据

import pymysql

# 待插入数据
id = '20120001'
user = 'Bob'
age = 20

# 连接数据库
db = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='111111',
    port=3306,
    db="vb"
)
cursor = db.cursor()

# 定义插入SQL（使用%s作为占位符）
sql = 'insert into students(id, name, age) values(%s, %s, %s)'
val = ["2012001", "Bob", "20"]  # 对应占位符的参数

try:
    cursor.execute(sql, val)  # 执行插入
    db.commit()  # 提交事务（确认插入）
except:
    db.rollback()  # 若失败则回滚
db.close()  # 关闭连接

import pymysql

# 待插入数据（字典形式，字段可动态增减）
data = {
    'id': '20120002',
    'name': 'Andy',
    'age': 20
}
table = 'students'  # 目标表名

# 拼接SQL（动态适配字典字段）
keys = ','.join(data.keys())  # 字段名：id,name,age
values = ', '.join(['%s'] * len(data))  # 占位符：%s, %s, %s

# 连接数据库
db = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    port=3306,
    db="cv"
)
cursor = db.cursor()

# 定义插入SQL（使用format动态填充表名、字段、占位符）
sql = 'insert into {table}({keys}) values({values})'.format(table=table, keys=keys, values=values)

try:
    # 执行插入（参数为字典值的元组）
    if cursor.execute(sql, tuple(data.values())):
        print('Successful')
        db.commit()  # 提交事务
except:
    print('Failed')
    db.rollback()  # 回滚
db.close()  # 关闭连接

3. 更新数据：用UPDATE语句指定更新字段和条件，通过cursor.execute(sql, (参数))执行，同样需事务处理。

import pymysql

# 连接数据库
db = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    port=3306,
    db="cv"
)
cursor = db.cursor()

# 定义更新SQL（更新Bob的年龄为23）
sql = 'update students set age=%s where name=%s'

try:
    cursor.execute(sql, (23, 'Bob'))  # 执行更新（参数为(新值, 条件值)）
    db.commit()  # 提交事务
except:
    db.rollback()  # 回滚
db.close()  # 关闭连接

4. 删除数据：用DELETE语句指定删除条件，通过cursor.execute(sql)执行，需事务处理。

import pymysql

table = 'students'  # 目标表名
condition = 'age >22'  # 删除条件（年龄大于22的记录）

# 连接数据库
db = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    port=3306,
    db="cv"
)
cursor = db.cursor()

# 定义删除SQL（动态填充表名和条件）
sql = 'delete from {table} where {condition}'.format(table=table, condition=condition)

try:
    cursor.execute(sql)  # 执行删除
    db.commit()  # 提交事务
except:
    db.rollback()  # 回滚
db.close()  # 关闭连接

5. 查询数据：用SELECT语句查询，通过cursor.fetchone()（单条）、fetchall()（所有）获取结果，cursor.rowcount获取数据行数。

import pymysql

# 连接数据库
db = pymysql.connect(
    host='127.0.0.1',
    user='root',
    password='123456',
    port=3306,
    db="cv"
)
cursor = db.cursor()

# 定义查询SQL（查询年龄>=20的记录）
sql = 'select * from students where age>=20'

try:
    cursor.execute(sql)  # 执行查询
    print('Count:', cursor.rowcount)  # 打印查询结果数量
    row = cursor.fetchone()  # 获取第一条记录
    while row:  # 循环获取所有记录
        print('Row:', row)
        row = cursor.fetchone()
except:
    print('Error')  # 捕获异常

三、实际案例

1. 豆瓣图书爬取

任务：爬取豆瓣新书速递信息（书名、评分、作者、出版社、出版时间），存储到MySQL。

步骤：

1. 爬取页面（通用爬虫框架）；

2. 创建bookinfo表（字段：id（自增主键）、BookName、Score、Autor、Press、Pubdate）；

3. 用XPath解析字段（书名、评分、作者等）；

4. 存储：通过cursor.executemany批量插入多条数据，用事务保障完整性。

import requests
from lxml import etree
import pymysql

def get_html(url,time=30): #get请求通用函数，去掉了user-agent简化代码
    try:
        headers={
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                         "AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/80.0.3987.132 Safari/537.36"}
        r = requests.get(url,headers=headers, timeout=time) # 发送请求
        r.encoding = r.apparent_encoding # 设置返回内容的字符集编码
        r.raise_for_status() # 返回的状态码不等于200抛出异常
        return r.text # 返回网页的文本内容
    except Exception as error:
        print(error)

def parser(html):          # 获取网页内容
    doc=etree.HTML(html)
    out_list=[]
    for row in doc.xpath("//*[@id='content']/div[2]/div[1]/ul/li"):
        title=row.xpath("div[2]/h2/a/text()")[0].strip()
        score_list=row.xpath("div[2]/p[2]/span[2]/text()")
        score = score_list[0].strip() if score_list else "暂无评分"
        info=row.xpath("div[2]/p[1]/text()")[0].strip().split("/")
        row_list=[
                title, #卖点标题
                score,
                info[0] ,
                info[-3] ,
                info[-4]
        ]
        out_list.append(row_list)
    return out_list;
def save_mysql(sql,val,**dbinfo):
    try:
        connect=pymysql.connect(**dbinfo)
        cursor=connect.cursor()
        cursor.executemany(sql, val)
        connect.commit()
    except Exception as err:
        connect.rollback()
    finally:
        cursor.close()
        connect.close()

if __name__=="__main__":
    url="https://book.douban.com/latest?icn=index-latestbook-all"
    html=get_html(url)  #发送请求
    out_list=parser(html) #数据解析、重新发送请求
    parms = {
        "host": "127.0.0.1",
        "user": "root",
        "password": "123456",
        "db": "cv",
        "charset": "utf8",
        "cursorclass": pymysql.cursors.DictCursor}
    sql = "INSERT into bookinfo(bookname,score,autor,press,pubdate) " \
            "VALUES(%s,%s,%s,%s,%s)"  # 带占位符的SQL
    save_mysql(sql,out_list,**parms)  # 调用函数，注意**不能省略

2. 安居客二手房信息爬取

任务：爬取重庆二手房信息（卖点、楼盘、地址、户型、面积、单价等），存储到MySQL。

步骤：

1. 爬取页面（requests发送请求）；

2. 创建houseinfo表（字段：id（自增主键）、title、house、address等）；

3. 用XPath解析字段，处理分页（递归调用爬取下一页）；

4. 存储：封装通用存储函数save_mysql，用executemany批量插入，支持动态数据库参数。

import requests
from lxml import etree
import csv,pymysql
def get_html(url,time=30): #get请求通用函数，去掉了user-agent简化代码
    try:
        headers={
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) ""AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/80.0.3987.132 Safari/537.36"}
        r = requests.get(url,headers=headers, timeout=time) # 发送请求
        r.encoding = r.apparent_encoding # 设置返回内容的字符集编码
        r.raise_for_status() # 返回的状态码不等于200抛出异常
        return r.text # 返回网页的文本内容
    except Exception as error:
        print(error)
out_list=[]
def parser(html): #解析函数
    doc=etree.HTML(html)
    for row in doc.xpath("//div[@class='property']"):
        title = row.xpath("a//h3/text()")[0].strip() #卖点标题
        #房屋户型|建筑面积|所在楼层|建造年代
        info = row.xpath("string(./a//*[@class='property-content-info'])")
#去掉房屋信息中多余的空格和换行符
        info=info.replace("  ","").replace("\n\n","\n").split("\n")
#以下按照顺序为楼盘，地址，价格
        house= row.xpath("a//*[@class='property-content-info-comm-name']/text()")[0]
        address=row.xpath("string(./a//*[@class='property-content-info-comm-address'])")
        price = row.xpath("a//*[@class='property-price-average']/text()")[0]
        price=price.replace(",","")
        row_list=[title, #卖点标题
                  house.strip(),#楼盘
                  address.strip(),#地址
                  info[0] if len(info)>=1 else "", #房屋户型
                  info[1] if len(info)>=2 else "", #建筑面积
                  info[3] if len(info)>=4 else "", #所在楼层
                  info[4] if len(info)>=5 else "", #建造年代
                  price #房价
                  ]
    out_list.append(row_list)
    ele_next = doc.xpath("//*[@class='next  next-active']/@href")
    if ele_next:
        # 递归调用，重新发送请求，并解析
        new_url = ele_next[0]
        if new_url.startswith("?page="):
            new_url = "http://www.bspider.top/anjuke/" + new_url
            parser(get_html(new_url))

def save_mysql(sql, val, **dbinfo):
    try:
        connect = pymysql.connect(**dbinfo)
        cursor = connect.cursor()  # 获取游标对象
        cursor.executemany(sql, val)  # 执行多条SQL
        connect.commit()  # 事务提交
    except Exception as err:
        connect.rollback()  # 事务回滚
        print(err)
    finally:
        cursor.close()
        connect.close()
if __name__=="__main__":
    url="http://www.bspider.top/anjuke/"
    html=get_html(url)  #发送请求
    parser(html) #数据解析、重新发送请求
    parms = {
        "host": "127.0.0.1",
        "user": "root",
        "password": "123456",
        "db": "cv",
        "charset": "utf8",
        "cursorclass": pymysql.cursors.DictCursor}
    sql = "INSERT into houseinfo(Title,House,Address,Struct,Area,Floor,MakeTime,price) " \
            "VALUES(%s,%s,%s,%s,%s,%s,%s,%s)"  # 带占位符的SQL
    save_mysql(sql, out_list, **parms)  # 调用函数，注意**不能省略