通用爬虫——某站小说抓取

最新推荐文章于 2024-08-16 08:20:25 发布

原创最新推荐文章于 2024-08-16 08:20:25 发布 · 552 阅读

1 ·

CC 4.0 BY-SA版权

爬虫实战专栏收录该内容

29 篇文章

订阅专栏

这是一个基于scrapy框架的通用爬虫，比较简单的抓取了两项数据，章节目录和内容。

爬的小说名字是：长宁帝军。SQL文件截图后就删了，盗版内容，不予传播，感谢作者和平台

下面是代码：

爬虫：cn.py：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from changning.items import ChangningItem
import re

class CnSpider(CrawlSpider):
    name = 'cn'
    allowed_domains = ['www.biquge.info']
    # 把start_urls改成第一章开始
    start_urls = ['https://www.biquge.info/39_39025/']

    # 修改规则，允许跟进爬取，链接中域名后的第一串数字代表小说
    # 第二串数字代表章节目录及文章内容
    rules = (
        Rule(LinkExtractor(allow=r'https://www.biquge.info/39_39025/\d+.html'),
             callback='parse_detail', follow=True,),
    )

    def parse_detail(self, response):
        print("hello")
        f_catalog = response.xpath("//div[@class='bookname']//h1").get()
        f_contents = response.xpath("//div[@id='content']").get()
        # 用正则过滤掉标签，如果用来发布网页，就不要过滤
        contents = re.sub(r'\n|<br>|\xa0|<div id="content">|</div>', '', f_contents)
        catalog = re.sub(r'<h1>|</h1>', '', f_catalog)
        item = ChangningItem(catalog=catalog, contents=contents)
        yield item

没什么难度，主要是跟进规则。

其他几个py文件也没有什么内容，无非就是数据入库，和基础的设置，本来就是练手的代码，所以也没啥好说的，下面就是数据库的截图：

900多章，大家多支持正版，盗版用来技术实践与交流就行了，因为最近找实习一直没进展，所以知道工作多难找，生活不易。

顺便，最近在学django，会贴一些知识点上来记录一下做个笔记。