Python3爬虫（1）小试牛刀

最新推荐文章于 2025-08-15 09:35:14 发布

wiz_333

最新推荐文章于 2025-08-15 09:35:14 发布

阅读量476

点赞数

CC 4.0 BY-SA版权

分类专栏： Python3.5爬虫文章标签：爬虫 urllib 爬取ip地址

本文链接：https://blog.csdn.net/wizblack/article/details/79746834

Python3.5爬虫专栏收录该内容

1 篇文章

订阅专栏

本文介绍了一个使用Python3进行网页爬取的例子，通过urllib模块发送HTTP请求获取HTML源码，并利用正则表达式提取IP地址信息，最后将数据保存到本地文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近学习了python3的爬虫机制！也就是通过使用urllib模块发送http请求来获取网页的html源码，然后使用正则匹配出自己想要收集的信息！(注意python2中所用的模块不一样)

下面小试牛刀，做了一个例子！附上需求和源码！

需求：分页爬取网页上的ip地址信息

储存形式：（将从网页上获取到的信息分行写入.txt文件中）

代码逻辑

#引入urllib模块，用于发起http请求
from urllib import request
#引入re模块，用于正则匹配
import re
#创建一个文件对象（w+追加写入模式）
file = open('ips.txt', 'w+', encoding='utf-8')

def get_html_66ip():
    #通过定义headers中的User-Agent伪装成一个browser，如果不定义headers，User-Agent会被默认解析成python3 urllib
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    # 66代理，全国代理ip(2018年验证ip,前89页)
    for page_num in range(1,90):
        target = 'http://www.66ip.cn/{}'.format(page_num)
        r_obj = request.Request(url=target,headers=headers)
        response = request.urlopen(r_obj)
        html = response.read().decode('gbk')
        # print(html)

        pat_html = re.compile(r'<tr><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td></tr>',re.S)

        #这里注意使用findall取出的数据res是一个列表，列表的每个元素是元组，即[('xxx','xxxx'),('xxxx','xxxxx')]
        res = pat_html.findall(html)

        #判断匹配出的信息，第一行表头不要
        ips = []
        for k,v in enumerate(res):
            if k != 0:
                ips.append(v)
        #将元组转成str，去除首尾的'('和')'写入文件
        for v in ips:
            file.write(str(v).lstrip('(').rstrip(')') + '\n')

#进程启动声明(程序从这里开始执行)
if __name__ == '__main__':
    #调用爬取def
    get_html_66ip()
    #关闭文件对象
    file.close()