python爬取网上的图片_如何爬网上图片-CSDN博客

本文链接：https://blog.csdn.net/qq_36097393/article/details/89707245

本文详细介绍如何使用Python进行网页图片爬取，包括处理无反爬机制的简单网站、应对反爬机制的复杂网站、解决URL不完整的问题、修正正则表达式错误及定位特定人物图片的策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.simple version

网站没有反扒机制的：

import urllib.request
import re
url="http://ohhappyday.com/" # 1.我们要爬取图片的地址
page = urllib.request.urlopen(url) # 2. 打开网址 print(page)
html = page.read().decode("utf-8") # 3. 获取html源码
imglist = re.findall('img src="(http.*?)"',html) # 4. 在html中匹配出符合条件的字符串
x=0
for imgurl in imglist: # 遍历图片地址列表
    urllib.request.urlretrieve(imgurl,'./picture/pic%s.jpg' %x) # 第四行 获取图片并保存
    x=x+1

2rd version

网站有反扒机制的

import urllib.request
import re
Head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'}
url="https://www.microsoft.com/en-us/research/people/shihan/" # 我们要爬取图片的地址
page1=urllib.request.Request(url,headers = Head)
page=urllib.request.urlopen(page1)
html = page.read().decode("utf-8") 
imglist = re.findall('content="(.*?\.png)"',html) # 第三行 在html中匹配出符合条件的字符串

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
x=0
for imgurl in imglist: # 遍历图片地址列表
    urllib.request.urlretrieve(imgurl,'./picture/picture%s.jpg' %x) # 第四行 获取图片并保存
    #urllib.request.urlretrieve(imgurl, ./picture/pic%s.jpg'%x)
    x=x+1

3rd

爬出来的url不全，没有http开头

import urllib.request
import re
Head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'}
url="https://www.microsoft.com/en-us/research/group/software-analytics/" # 我们要爬取图片的地址
page1=urllib.request.Request(url,headers = Head)
page=urllib.request.urlopen(page1)
html = page.read().decode("utf-8") 
imglist3 = re.findall('src=\'(//[^\s]*?.jpg)\'',html) # 第三行 在html中匹配出符合条件的字符串

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
x=0
for imgurl in imglist3: # 遍历图片地址列表
    urllib.request.urlretrieve("https:"+imgurl,'./picture/3picture%s.jpg' %x) # 第四行 获取图片并保存
    #urllib.request.urlretrieve(imgurl, ./picture/pic%s.jpg'%x)
    x=x+1

在这里插入图片描述

4rd

正则表达式设计不对
在这里插入图片描述

imglist41 = re.findall('img alt=\'([^\d]*?)\' src=\'(//[^\s]*?.jpg)\' srcset=',html) 
#存储也可以带上名字
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
x=0
for imgurl in imglist41: # 遍历图片地址列表
    urllib.request.urlretrieve("https:"+imgurl[1],'./picture/%s.jpg' %imgurl[0]) # 第四行 获取图片并保存
    #urllib.request.urlretrieve(imgurl, ./picture/pic%s.jpg'%x)
    x=x+1

在这里插入图片描述

5rd

爬取特定人的

import urllib.request
import re

Head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'}
url="https://www.microsoft.com/en-us/research/group/software-analytics/" # 我们要爬取图片的地址
page1=urllib.request.Request(url,headers = Head)
page=urllib.request.urlopen(page1)
html = page.read().decode("utf-8") 

imglist51 = re.findall('img alt=\'([^\d]* Shi Han)\' src=\'(//[^\s]*?.png)\' srcset=',html) #.* 代表匹配除换行符之外的所有字符 ?为非贪婪模式
imglist61 = imglist51+ re.findall('img alt=\'([^\d]*?)\' src=\'(//[^\s]*?.jpg)\' srcset=',html)
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
x=0
for imgurl in imglist61: # 遍历图片地址列表
    print
    print(x)
    urllib.request.urlretrieve("https:"+imgurl[1],'./picture/%s.jpg' %imgurl[0]) # 第四行 获取图片并保存
    #urllib.request.urlretrieve(imgurl, ./picture/pic%s.jpg'%x)
    x=x+1
    if x == 3 :
        break