Scrapy---代理、信号、自定义命令

最新推荐文章于 2024-05-31 21:24:28 发布

易辰_

最新推荐文章于 2024-05-31 21:24:28 发布

阅读量902

点赞数

CC 4.0 BY-SA版权

分类专栏：爬虫

本文链接：https://blog.csdn.net/u013210620/article/details/80311802

爬虫专栏收录该内容

7 篇文章

订阅专栏

本文介绍如何在Scrapy项目中配置代理服务器，并通过两种不同的方式实现动态代理的使用。此外，还提供了自定义Scrapy命令的方法，帮助开发者更好地管理和运行爬虫任务。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

添加代理

方式一：

                import os
                import scrapy
                from scrapy.http import Request

                class ChoutiSpider(scrapy.Spider):
                    name = 'chouti'
                    allowed_domains = ['chouti.com']
                    start_urls = ['https://dig.chouti.com/']

                    def start_requests(self):
                        os.environ['HTTP_PROXY'] = "http://192.168.11.11"

                        for url in self.start_urls:
                            yield Request(url=url,callback=self.parse)

                    def parse(self, response):
                        print(response)

方式二

import random
                import base64
                import six
                def to_bytes(text, encoding=None, errors='strict'):
                    """Return the binary representation of `text`. If `text`
                    is already a bytes object, return it as-is."""
                    if isinstance(text, bytes):
                        return text
                    if not isinstance(text, six.string_types):
                        raise TypeError('to_bytes must receive a unicode, str or bytes '
                                        'object, got %s' % type(text).__name__)
                    if encoding is None:
                        encoding = 'utf-8'
                    return text.encode(encoding, errors)

                class MyProxyDownloaderMiddleware(object):
                    def process_request(self, request, spider):
                        proxy_list = [
                            {'ip_port': '111.11.228.75:80', 'user_pass': 'xxx:123'},
                            {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                            {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                            {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                            {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                            {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
                        ]
                        proxy = random.choice(proxy_list)
                        if proxy['user_pass'] is not None:
                            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                            encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                            request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                        else:
                            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])


                配置：
                    DOWNLOADER_MIDDLEWARES = {
                       # 'xiaohan.middlewares.MyProxyDownloaderMiddleware': 543,
                    }

https

from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
                from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)

                class MySSLFactory(ScrapyClientContextFactory):
                    def getCertificateOptions(self):
                        from OpenSSL import crypto
                        v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                        v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                        return CertificateOptions(
                            privateKey=v1,  # pKey对象
                            certificate=v2,  # X509对象
                            verify=False,
                            method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                        )

                配置： 
                    DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
                    DOWNLOADER_CLIENTCONTEXTFACTORY = "xiaohan.middlewares.MySSLFactory"

信号

extends.py 
                from scrapy import signals


                class MyExtension(object):
                    def __init__(self):
                        pass

                    @classmethod
                    def from_crawler(cls, crawler):
                        obj = cls()
                        # 在爬虫打开时，触发spider_opened信号相关的所有函数：xxxxxxxxxxx
                        crawler.signals.connect(obj.xxxxxxxxxxx1, signal=signals.spider_opened)
                        # 在爬虫关闭时，触发spider_closed信号相关的所有函数：xxxxxxxxxxx
                        crawler.signals.connect(obj.uuuuuuuuuu, signal=signals.spider_closed)
                        return obj

                    def xxxxxxxxxxx1(self, spider):
                        print('open')

                    def uuuuuuuuuu(self, spider):
                        print('close')

                            return obj

配置：

    EXTENSIONS = {
                'xiaohan.extends.MyExtension':500,
            }

自定制命令

在spiders同级创建任意目录，如：commands
在其中创建 crawlall.py 文件（此处文件名就是自定义的命令）

from scrapy.commands import ScrapyCommand
    from scrapy.utils.project import get_project_settings


    class Command(ScrapyCommand):

        requires_project = True

        def syntax(self):
            return '[options]'

        def short_desc(self):
            return 'Runs all of the spiders'

        def run(self, args, opts):
            spider_list = self.crawler_process.spiders.list()
            for name in spider_list:
                self.crawler_process.crawl(name, **opts.__dict__)
            self.crawler_process.start()

crawlall.py