说明:通过下述3个例子,实现对scrapy框架的简单认识以及使用。
难度说明:例1<<例2<<例3
【例2和例3涉及翻页】
【例3涉及到图片的保存(图片下载管道ImagesPipeline)】
例1、利用scrapy框架爬取http://www.itcast.cn/channel/teacher.shtml 下老师的姓名、职业以及个人简介
1、cmd中的基础操作
C:\Users\Administrator\Desktop>scrapy startproject teacherinfo C:\Users\Administrator\Desktop>cd teacherinfo C:\Users\Administrator\Desktop\teacherinfo>scrapy genspider demo http://www.itcast.cn
2、修改items.py,定义需要保存的字段
from scrapy import Item,Field class TeacherinfoItem(Item): name=Field() #姓名 tag=Field() #标签 info=Field() #介绍
3、在demo.py中编写爬虫代码
import scrapy from teacherinfo.items import TeacherinfoItem #导入自定义的item class DemoSpider(scrapy.Spider): name = 'demo' #爬虫名 allowed_domains = ['http://www.itcast.cn'] #允许爬虫作业的范围 start_urls = ['http://www.itcast.cn/channel/teacher.shtml'] #爬虫的初始url #start_urls是爬虫执行时,第一次取url地址(无论start_urls有多少个,都只会取一次的url) def parse(self, response): #response为响应文件,其中scrapy中的response.body拿到返回内容 divs=response.xpath('//div[@class="li_txt"]') for div in divs: item=TeacherinfoItem() #实例化一个对象,用于保存数据 item['name']=div.xpath('./h3/text()').extract()[0] item['tag']=div.xpath('./h4/text()').extract()[0] item['info']=div.xpath('./p/text()').extract()[0] yield item #使用yield将获取的数据交给pipelines进行处理; #若使用return,则不经过pipeline #【实际爬虫时,都采用yield,将爬取的item交给pipeline处理】
4、在管道文件pipelines.py中对item进行编辑
import json
class TeacherinfoPipeline(object): #该管道文件用于保存数据
def __init__(self): #该方法可选,类的初始化
self.filename=open('teacherinfo.json','w',encoding='utf-8')
def process_item(self, item, spider): #该方法必须有!
jsontext=json.dumps(dict(item),ensure_ascii=False)
self.filename.write(jsontext+'\n')
return item
def close_spider(self,spider): #该方法可选
self.filename.close()
5、在settings.py中打开pipeline管道文件
ITEM_PIPELINES = { 'teacherinfo.pipelines.TeacherinfoPipeline': 300, }
6、运行
C:\Users\Administrator\Desktop\teacherinfo>scrapy crawl demo
完成爬取,且生成teacherinfo.json文件
PS:实际上例1中的pipeline可以不定义,直接使用scrapy crawl demo -o teacherinfo.json也可达到相同的效果。
例2、利用scrapy框架爬取https://careers.tencent.com/search.html下职位信息
解析:实际存储信息网址:
https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=1&pageSize=10&language=zh-cn&area=cn
https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=2&pageSize=10&language=zh-cn&area=cn
https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=3&pageSize=10&language=zh-cn&area=cn
......
https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex=483&pageSize=10&language=zh-cn&area=cn
总结为:'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={}&pageSize=10&language=zh-cn&area=cn'
重点:涉及到翻页!
1、在cmd中操作
C:\Users\Administrator\Desktop>scrapy startproject tencent C:\Users\Administrator\Desktop>cd tencent C:\Users\Administrator\Desktop\tencent>scrapy genspider demo careers.tencent.com
2、修改items.py,定义需要保存的字段
from scrapy import Item,Field class TencentItem(Item): positionname=Field() #职位名称 positionurl=Field() #职位链接 positiontype=Field() #职位类型 positionresponsibility=Field() #职位责任 positionlocation=Field() #职位地址 positiontime=Field() #发布时间
3、在demo.py中编写爬虫代码
import scrapy import json from tencent.items import TencentItem class DemoSpider(scrapy.Spider): name = 'demo' allowed_domains = ['careers.tencent.com'] #此处的url需要与pageIndex进行拼接,得到start_urls url='https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={}&pageSize=10&language=zh-cn&area=cn' pageIndex=1 start_urls=[url.format(pageIndex)] def parse(self, response): data=json.loads(response.text) #response.text返回的是str类型 posts=data['Data']['Posts'] for post in posts: item=TencentItem() item['positionname']=post['RecruitPostName'] item['positionurl']='https://careers.tencent.com/jobdesc.html?postId='+post['PostId'] item['positiontype']=post['CategoryName'] item['positionresponsibility']=post['Responsibility'].replace('\n','') item['positionlocation']=post['CountryName']+'-'+post['LocationName'] item['positiontime']=post['LastUpdateTime'] yield item #将每个职位信息item交给管道文件(pipeline)处理 if self.pageIndex<=483: self.pageIndex+=1 #pageIndex自增1,同时拼接为新的url,并调用回调函数(self.parse)处理新一页的response yield scrapy.Request(url=self.url.format(self.pageIndex),callback=self.parse)
4、在管道文件pipelines.py中对item进行编辑
import json class TencentPipeline(object): def __init__(self): self.filename=open('tencent.json','w',encoding='utf-8') def process_item(self, item, spider): jsontext=json.dumps(dict(item),ensure_ascii=False) self.filename.write(jsontext+'\n') return item def close_spider(self,spider): self.filename.close()
5、在settings.py中打开pipeline管道文件
ITEM_PIPELINES = { 'tencent.pipelines.TencentPipeline': 300, }
6、运行
C:\Users\Administrator\Desktop\tencent>scrapy crawl demo
完成爬取,且生成tencent.json文件
PS:实际上例1中的pipeline可以不定义,直接使用scrapy crawl demo -o tencent.json也可达到相同的效果。
例3、利用scrapy框架爬取搜狗图片中ikon图片
解析:实际存储信息网址:
https://pic.sogou.com/pics?query=ikon&mode=1&start=48&reqType=ajax&reqFrom=result&tn=0
https://pic.sogou.com/pics?query=ikon&mode=1&start=96&reqType=ajax&reqFrom=result&tn=0
https://pic.sogou.com/pics?query=ikon&mode=1&start=144&reqType=ajax&reqFrom=result&tn=0
......
总结为:'https://pic.sogou.com/pics?query=ikon&mode=1&start={}&reqType=ajax&reqFrom=result&tn=0'
重点1:涉及到翻页!
重点2:涉及到图片保存!(ImagePipeline)
1、在cmd中的操作
C:\Users\Administrator\Desktop>scrapy startproject sogouimage C:\Users\Administrator\Desktop>cd sogouimage C:\Users\Administrator\Desktop\sogouimage>scrapy genspider demo pic.sogou.com
2、修改items.py,定义需要保存的字段
from scrapy import Field,Item class SogouimageItem(Item): title=Field() #名称 picurl=Field() #图片链接 imagepath=Field() #图片下载的本地路径(自己设置)
3、在demo.py中编写爬虫代码
import scrapy import json from sogouimage.items import SogouimageItem class DemoSpider(scrapy.Spider): name = 'demo' allowed_domains = ['pic.sogou.com'] # 此处的url需要与start进行拼接,得到start_urls url='https://pic.sogou.com/pics?query=ikon&mode=1&start={}&reqType=ajax&reqFrom=result&tn=0' start=48 start_urls = [url.format(start)] def parse(self, response): data=json.loads(response.text)['items'] #response.text返回的是str类型 for info in data: item=SogouimageItem() item['title']=info['title'] item['picurl']=info['pic_url'] yield item self.start+=48 #start自增48,同时拼接为新的url,并调用回调函数(self.parse)处理新一页的response yield scrapy.Request(url=self.url.format(self.start),callback=self.parse)
4、编辑pipelines.py 实现图片的下载(自定义图片下载管道ImagePipeline)
import scrapy,os from scrapy.utils.project import get_project_settings from scrapy.pipelines.images import ImagesPipeline class MyImagePipeline(ImagesPipeline): #设置图片下载管道,采用继承 IMAGES_STORE=get_project_settings().get('IMAGES_STORE') #从settings.py中拿到IMAGES_STORE def get_media_requests(self, item, info): picurl=item['picurl'] yield scrapy.Request(picurl) #访问图片 def item_completed(self, results, item, info): imagepath=[x['path'] for ok, x in results if ok] os.rename(self.IMAGES_STORE+'/'+imagepath[0],self.IMAGES_STORE+'/'+item['title']+'.jpg') item['imagepath']=self.IMAGES_STORE+'/'+item['title'] #图片下载的本地地址
5、在settings.py中打开自定义的MyImagePipeline管道文件,并定义IMAGE_STORE
ITEM_PIPELINES = { # 'sogouimage.pipelines.SogouimagePipeline': 300, 'sogouimage.pipelines.MyImagePipeline':300, } IMAGES_STORE=r"./Images"
6、运行
C:\Users\Administrator\Desktop\sogouimage>scrapy crawl dem
完成爬取,且在sogouimage/Images中下载图片