用scrapy实现爬虫时,每个爬虫文件只能对应一个pipeline文件吗

用scrapy实现爬虫时,每个爬虫文件只能对应一个pipeline文件吗,第1张

Pipeline基本上都是你自己写的。能不能适用多个爬虫就看你怎么写的了。

很明确的告诉你,Pipeline可以适用多个爬虫。Scrapy自身提供了下载文件的管道,FilesPipeline、 ImagesPipeline便是很好的例子。http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/images.html#id2

一个爬虫也可以使用多个Pipeline。

这个要在settings的ITEM_PIPELINES字典中添加就可以生效。怎么做官方文档都有,就不具体说了。http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/item-pipeline.html#id4

生成Request的时候与一般的网页是相同的,提交Request后scrapy就会下载相应的网页生成Response,这时只用解析response.body按照解析json的方法就可以提取数据了。代码示例如下(以京东为例,其中的parse_phone_price和parse_commnets是通过json提取的,省略部分代码):

# -*- coding: utf-8 -*-

 

from scrapy.spiders import Spider, CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from jdcom.items import JdPhoneCommentItem, JdPhoneItem

from scrapy import Request

from datetime import datetime

import json

import logging

import re

 

logger = logging.getLogger(__name__)

 

 

class JdPhoneSpider(CrawlSpider):

    name = "jdPhoneSpider"

    start_urls = ["http://list.jd.com/list.html?cat=9987,653,655"]

 

    rules = (

        Rule(

            LinkExtractor(allow=r"list\.html\?cat\=9987,653,655\&page\=\d+\&trans\=1\&JL\=6_0_0"),

            callback="parse_phone_url",

            follow=True,

        ),

    )

 

    def parse_phone_url(self, response):

        hrefs = response.xpath("//div[@id='plist']/ul/li/div/div[@class='p-name']/a/@href").extract()

        phoneIDs = []

        for href in hrefs:

            phoneID = href[14:-5]

            phoneIDs.append(phoneID)

            commentsUrl = "http://sclub.jd.com/productpage/p-%s-s-0-t-3-p-0.html" % phoneID

            yield Request(commentsUrl, callback=self.parse_commnets)

 

    def parse_phone_price(self, response):

        phoneID = response.meta['phoneID']

        meta = response.meta

        priceStr = response.body.decode("gbk", "ignore")

        priceJson = json.loads(priceStr)

        price = float(priceJson[0]["p"])

        meta['price'] = price

        phoneUrl = "http://item.jd.com/%s.html" % phoneID

        yield Request(phoneUrl, callback=self.parse_phone_info, meta=meta)

 

    def parse_phone_info(self, response):  

        pass

 

    def parse_commnets(self, response):

 

        commentsItem = JdPhoneCommentItem()

        commentsStr = response.body.decode("gbk", "ignore")

        commentsJson = json.loads(commentsStr)

        comments = commentsJson['comments']

 

        for comment in comments:

            commentsItem['commentId'] = comment['id']

            commentsItem['guid'] = comment['guid']

            commentsItem['content'] = comment['content']

            commentsItem['referenceId'] = comment['referenceId']

            # 2016-09-19 13:52:49  %Y-%m-%d %H:%M:%S

            datetime.strptime(comment['referenceTime'], "%Y-%m-%d %H:%M:%S")

            commentsItem['referenceTime'] = datetime.strptime(comment['referenceTime'], "%Y-%m-%d %H:%M:%S")

 

            commentsItem['referenceName'] = comment['referenceName']

            commentsItem['userProvince'] = comment['userProvince']

            # commentsItem['userRegisterTime'] = datetime.strptime(comment['userRegisterTime'], "%Y-%m-%d %H:%M:%S")

            commentsItem['userRegisterTime'] = comment.get('userRegisterTime')

            commentsItem['nickname'] = comment['nickname']

            commentsItem['userLevelName'] = comment['userLevelName']

            commentsItem['userClientShow'] = comment['userClientShow']

            commentsItem['productColor'] = comment['productColor']

            # commentsItem['productSize'] = comment['productSize']

            commentsItem['productSize'] = comment.get("productSize")

            commentsItem['afterDays'] = int(comment['days'])

            images = comment.get("images")

            images_urls = ""

            if images:

                for image in images:

                    images_urls = image["imgUrl"] + ""

            commentsItem['imagesUrl'] = images_urls

        yield commentsItem

 

        commentCount = commentsJson["productCommentSummary"]["commentCount"]

        goodCommentsCount = commentsJson["productCommentSummary"]["goodCount"]

        goodCommentsRate = commentsJson["productCommentSummary"]["goodRate"]

        generalCommentsCount = commentsJson["productCommentSummary"]["generalCount"]

        generalCommentsRate = commentsJson["productCommentSummary"]["generalRate"]

        poorCommentsCount = commentsJson["productCommentSummary"]["poorCount"]

        poorCommentsRate = commentsJson["productCommentSummary"]["poorRate"]

        phoneID = commentsJson["productCommentSummary"]["productId"]

 

        priceUrl = "http://p.3.cn/prices/mgets?skuIds=J_%s" % phoneID

        meta = {

            "phoneID": phoneID,

            "commentCount": commentCount,

            "goodCommentsCount": goodCommentsCount,

            "goodCommentsRate": goodCommentsRate,

            "generalCommentsCount": generalCommentsCount,

            "generalCommentsRate": generalCommentsRate,

            "poorCommentsCount": poorCommentsCount,

            "poorCommentsRate": poorCommentsRate,

        }

        yield Request(priceUrl, callback=self.parse_phone_price, meta=meta)

 

        pageNum = commentCount / 10 + 1

        for i in range(pageNum):

            commentsUrl = "http://sclub.jd.com/productpage/p-%s-s-0-t-3-p-%d.html" % (phoneID, i)

            yield Request(commentsUrl, callback=self.parse_commnets)


欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/tougao/11762370.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2023-05-18
下一篇 2023-05-18

发表评论

登录后才能评论

评论列表(0条)

保存