1、创建一个scrapy工程
1.1 在pycharm的终端窗口输入如下的命令,表示创建了一个叫essays的scrapy项目
scrapy startproject essays
1.2 在终端输入如下命令,切换到新建的文件目录下
cd essays
1.3 在spiders子目录中创建一个名essay,初始url为www.xxx.com的爬虫文件,在终端输入如下命令
scrapy genspider essay www.xxx.com
2、配置文件修改
双击打开essays文件目录下的settings.py文件,
2.1 打开并修改USER_AGENT的注释
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Core/1.94.197.400 QQBrowser/11.6.5265.400'"
2.2 将ROBOTSTXT_OBEY的True改为False,否则爬取不到数据
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
2.3 新添加一行代码,以避免输出全部日志,仅当程序出错时,输出错误日志
LOG_LEVEL = 'ERROR'
2.4 打开ITEM_PIPELINES的注释
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"essays.pipelines.EssaysPipeline": 300,
}
3、代码部分
3.1 items.py如下,定义相关的属性
import scrapy
class EssaysItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
content = scrapy.Field()
3.2 essay.py如下,对全站深度爬取来的数据进行解析和item的封装,并提交给管道
iimport scrapy
from essays.items import EssaysItem
class EssaySpider(scrapy.Spider):
name = "essay"
# allowed_domains = ["www.5000yuan.com"]
start_urls = ["https://5000yan.com/plus/search.php?q=%E4%BC%A0%E7%BB%9F&type="]
page_num = 2
url = 'https://5000yan.com/plus/search.php?keyword=%%E4%%BC%%A0%%E7%%BB%%9F&searchtype=titlekeyword&channeltype=0&orderby=&kwtype=0&pagesize=10&typeid=0&TotalResult=588&PageNo=%d'
def parse_detail(self, response):
# print(response)
item = response.meta['item']
# print(item)
# /html/body/div[2]/div[1]/main
article_list = response.xpath('/html/body/div[2]/div[1]/main')
# print(article_list)
for article in article_list:
content = article.xpath('./article/h2/a/text() | ./div//text() | '
'./article/div/a/text() | ./div/div/a/text()').extract()
content = ''.join(content)
item['content'] = content
yield item
# print(content)
def parse(self, response):
# print(response)
article_list = response.xpath('/html/body/div[2]/div[1]/main/div[1]/div/section/article')
for article in article_list:
name = article.xpath('./h2/a/text()').extract_first()
detail_url = article.xpath('./h2/a/@href').extract_first()
# print(url)
# 实例化一个item对象
item = EssaysItem()
item['name'] = name
# 对详情页发请求获取详情页的页面源码数据
# 手动请求的发送
# 请求传参:meta={},可以将meta字典传递给请求对应的回调函数
yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
# 分页操作
if self.page_num <= 59:
new_url = format(self.url % self.page_num)
self.page_num += 1
# print(new_url)
yield scrapy.Request(new_url, callback=self.parse)
3.3 pipelines.py如下,将爬取的数据进行持久化存储到本地,essays.txt文件
class EssaysPipeline:
fp = None
# 重写父类的一个方法:爬虫开始前执行一次,创建并打开文件
def open_spider(self, spider):
self.fp = open('./essays.txt', 'w', encoding='utf-8')
print('正在爬取文件到本地...')
def process_item(self, item, spider):
self.fp.write(item['name'] + ':' + item['content'] + '\n')
print(item['name'], '在本地爬取成功!')
return item
# 爬虫完成后执行一次,关闭文件
def close_spider(self, spider):
self.fp.close()
print('文件爬取成功!')