Python自定义scrapy中间模块避免重复采集的方法

Python自定义scrapy中间模块避免重复采集的方法,第1张

概述本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

@H_502_2@
from scrapy import logfrom scrapy.http import Requestfrom scrapy.item import BaseItemfrom scrapy.utils.request import request_fingerprintfrom myproject.items import MyItemclass IgnoreVisitedItems(object):  """MIDdleware to ignore re-visiting item pages if they  were already visited before.   The requests to be filtered by have a Meta['filter_visited']  flag enabled and optionally define an ID to use   for IDentifying them,which defaults the request fingerprint,although you'd want to use the item ID,if you already have it beforehand to make it more robust.  """  FILTER_VISITED = 'filter_visited'  VISITED_ID = 'visited_ID'  CONTEXT_KEY = 'visited_IDs'  def process_spIDer_output(self,response,result,spIDer):    context = getattr(spIDer,'context',{})    visited_IDs = context.setdefault(self.CONTEXT_KEY,{})    ret = []    for x in result:      visited = False      if isinstance(x,Request):        if self.FILTER_VISITED in x.Meta:          visit_ID = self._visited_ID(x)          if visit_ID in visited_IDs:            log.msg("Ignoring already visited: %s" % x.url,level=log.INFO,spIDer=spIDer)            visited = True      elif isinstance(x,BaseItem):        visit_ID = self._visited_ID(response.request)        if visit_ID:          visited_IDs[visit_ID] = True          x['visit_ID'] = visit_ID          x['visit_status'] = 'new'      if visited:        ret.append(MyItem(visit_ID=visit_ID,visit_status='old'))      else:        ret.append(x)    return ret  def _visited_ID(self,request):    return request.Meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

总结

以上是内存溢出为你收集整理的Python自定义scrapy中间模块避免重复采集的方法全部内容,希望文章能够帮你解决Python自定义scrapy中间模块避免重复采集的方法所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1203014.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-04
下一篇 2022-06-04

发表评论

登录后才能评论

评论列表(0条)

保存