python – 在for循环中运行多个spider

python – 在for循环中运行多个spider,第1张

概述我尝试实例化多个蜘蛛.第一个工作正常,但第二个给我一个错误:ReactorNotRestartable. feeds = { 'nasa': { 'name': 'nasa', 'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss', 'start_urls': ['https://ww 我尝试实例化多个蜘蛛.第一个工作正常,但第二个给我一个错误:ReactorNotRestartable.

Feeds = {    'nasa': {        'name': 'nasa','url': 'https://www.nasa.gov/RSS/dyn/breaking_news.RSS','start_urls': ['https://www.nasa.gov/RSS/dyn/breaking_news.RSS']    },'xkcd': {        'name': 'xkcd','url': 'http://xkcd.com/RSS.xml','start_urls': ['http://xkcd.com/RSS.xml']    }    }

通过上面的项目,我尝试在循环运行两个蜘蛛,如下所示:

from scrapy.crawler import CrawlerProcessfrom scrapy.spIDers import XMLFeedSpIDerclass MySpIDer(XMLFeedSpIDer):    name = None    def __init__(self,**kwargs):        this_Feed = Feeds[self.name]        self.start_urls = this_Feed.get('start_urls')        self.iterator = 'iternodes'        self.itertag = 'items'        super(MySpIDer,self).__init__(**kwargs)def parse_node(self,response,node):    passdef start_crawler():    process = CrawlerProcess({        'USER_AGENT': CONfig['USER_AGENT'],'DOWNLOAD_HANDLERS': {'s3': None} # boto issues    })    for Feed_name in Feeds.keys():        MySpIDer.name = Feed_name        process.crawl(MySpIDer)        process.start()

第二个循环的例外看起来像这样,蜘蛛打开了,但随后:

...2015-11-22 00:00:00 [scrapy] INFO: SpIDer opened2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console Listening on 127.0.0.1:60232015-11-21 23:54:05 [scrapy] DEBUG: Telnet console Listening on 127.0.0.1:6023Traceback (most recent call last):  file "env/bin/start_crawler",line 9,in <module>    load_entry_point('Feed-crawler==0.0.1','console_scripts','start_crawler')()  file "/Users/bling/py-Feeds-crawler/Feed_crawler/crawl.py",line 51,in start_crawler    process.start() # the script will block here until the crawling is finished  file "/Users/bling/py-Feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py",line 251,in start    reactor.run(installSignalHandlers=False)  # blocking call  file "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 1193,in run    self.startRunning(installSignalHandlers=installSignalHandlers)  file "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 1173,in startRunning    ReactorBase.startRunning(self)  file "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",line 684,in startRunning    raise error.ReactorNotRestartable()twisted.internet.error.ReactorNotRestartable

我是否必须使第一个MySpIDer无效或我做错了什么,需要改变它的工作原理.提前致谢.

解决方法 看起来你必须为每个蜘蛛实例化一个进程,尝试:

def start_crawler():          for Feed_name in Feeds.keys():        process = CrawlerProcess({            'USER_AGENT': CONfig['USER_AGENT'],'DOWNLOAD_HANDLERS': {'s3': None} # boto issues        })        MySpIDer.name = Feed_name        process.crawl(MySpIDer)        process.start()
总结

以上是内存溢出为你收集整理的python – 在for循环中运行多个spider全部内容,希望文章能够帮你解决python – 在for循环中运行多个spider所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/langs/1196022.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-03
下一篇 2022-06-03

发表评论

登录后才能评论

评论列表(0条)

保存