我正在使用一个使用Tor的刮刀,在这个示例项目中有一个简化版本:https://github.com/khpeek/scraper-compose.该项目具有以下(简化)结构:
.├── docker-compose.yml├── privoxy│ ├── config│ └── Dockerfile├── scraper│ ├── Dockerfile│ ├── requirements.txt│ ├── tutorial│ │ ├── scrapy.cfg│ │ └── tutorial│ │ ├── extensions.py│ │ ├── __init__.py│ │ ├── items.py│ │ ├── mIDdlewares.py│ │ ├── pipelines.py│ │ ├── settings.py│ │ ├── spIDers│ │ │ ├── __init__.py│ │ │ └── quotes_spIDer.py│ │ └── tor_controller.py│ └── wait-for│ └── wait-for└── tor ├── Dockerfile └── torrc
定义quotes_spIDer.py的蜘蛛是一个非常简单的基于Scrapy Tutorial:
import scrapyfrom tutorial.items import QuoteItemclass QuotesspIDer(scrapy.SpIDer): name = "quotes" start_urls = ['http://quotes.toscrape.com/page/{n}/'.format(n=n) for n in range(1,3)] custom_settings = { 'TOR_RENEW_IDENTITY_ENABLED': True,'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 5 } download_delay = 2 # Wait 2 seconds (actually a random time between 1 and 3 seconds) between downloading pages def parse(self,response): for quote in response.CSS('div.quote'): item = QuoteItem() item['text'] = quote.CSS('span.text::text').extract_first() item['author'] = quote.CSS('small.author::text').extract_first() item['Tags'] = quote.CSS('div.Tags a.tag::text').extract() yIEld item
在settings.py中,我用线路激活了Scrapy extension
EXTENSIONS = { 'tutorial.extensions.TorRenewIDentity': 1,}
其中extensions.py是
import loggingimport randomfrom scrapy import signalsfrom scrapy.exceptions import NotConfiguredimport tutorial.tor_controller as tor_controllerlogger = logging.getLogger(__name__)class TorRenewIDentity(object): def __init__(self,crawler,item_count): self.crawler = crawler self.item_count = self.randomize(item_count) # Randomize the item count to confound traffic analysis self._item_count = item_count # Also remember the given item count for future randomizations self.items_scraped = 0 # Connect the extension object to signals self.crawler.signals.connect(self.item_scraped,signal=signals.item_scraped) @staticmethod def randomize(item_count,min_factor=0.5,max_factor=1.5): '''Randomize the number of items scraped before changing IDentity. (A similar technique is applIEd to Scrapy's DOWNLOAD_DELAY setting).''' randomized_item_count = random.randint(int(min_factor*item_count),int(max_factor*item_count)) logger.info("The crawler will scrape the following (randomized) number of items before changing IDentity (again): {}".format(randomized_item_count)) return randomized_item_count @classmethod def from_crawler(cls,crawler): if not crawler.settings.getbool('TOR_RENEW_IDENTITY_ENABLED'): raise NotConfigured item_count = crawler.settings.getint('TOR_ITEMS_TO_SCRAPE_PER_IDENTITY',50) return cls(crawler=crawler,item_count=item_count) # Instantiate the extension object def item_scraped(self,item,spIDer): '''When item_count items are scraped,pause the engine and change IP address.''' self.items_scraped += 1 if self.items_scraped == self.item_count: logger.info("Scraped {item_count} items. Pausing engine while changing IDentity...".format(item_count=self.item_count)) self.crawler.engine.pause() tor_controller.change_IDentity() # Change IP address (cf. https://stem.torproject.org/faq.HTML#how-do-i-request-a-new-IDentity-from-tor) self.items_scraped = 0 # reset the counter self.item_count = self.randomize(self._item_count) # Generate a new random number of items to scrape before changing IDentity again self.crawler.engine.unpause()
和tor_controller.py是
import loggingimport sysimport socketimport timeimport requestsimport stemimport stem.control# Tor settingsTOR_ADDRESS = socket.gethostbyname("tor") # The Docker-Compose service in which this code is running should be linked to the "tor" service.TOR_CONTRol_PORT = 9051 # This is configured in /etc/tor/torrc by the line "ControlPort 9051" (or by launching Tor with "tor -controlport 9051")TOR_PASSWORD = "foo" # The Tor password is written in the docker-compose.yml file. (It is passed as a build argument to the 'tor' service).# privoxy settingsprivoxy_ADDRESS = "privoxy" # This assumes this code is running in a Docker-Compose service linked to the "privoxy" serviceprivoxy_PORT = 8118 # This is determined by the "Listen-address" in privoxy's "config" filehttp_PROXY = 'http://{address}:{port}'.format(address=privoxy_ADDRESS,port=privoxy_PORT)logger = logging.getLogger(__name__)class TorController(object): def __init__(self): self.controller = stem.control.Controller.from_port(address=TOR_ADDRESS,port=TOR_CONTRol_PORT) self.controller.authenticate(password=TOR_PASSWORD) self.session = requests.Session() self.session.proxIEs = {'http': http_PROXY} def request_ip_change(self): self.controller.signal(stem.Signal.NEWNYM) def get_ip(self): '''Check what the current IP address is (as seen by IPEcho).''' return self.session.get('http://ipecho.net/plain').text def change_ip(self): '''Signal a change of IP address and wait for confirmation from IPEcho.net''' current_ip = self.get_ip() logger.deBUG("Initializing change of IDentity from the current IP address,{current_ip}".format(current_ip=current_ip)) self.request_ip_change() while True: new_ip = self.get_ip() if new_ip == current_ip: logger.deBUG("The IP address is still the same. Waiting for 1 second before checking again...") time.sleep(1) else: break logger.deBUG("The IP address has been changed from {old_ip} to {new_ip}".format(old_ip=current_ip,new_ip=new_ip)) return new_ip def __enter__(self): return self def __exit__(self,*args): self.controller.close()def change_IDentity(): with TorController() as tor_controller: tor_controller.change_ip()
如果我开始使用docker-compose build进行爬网,然后使用docker-compose up,大体上扩展工作:根据日志,它成功更改IP地址并继续抓取.
然而,令我烦恼的是,在引擎暂停期间,我看到了错误消息,例如
scraper_1 | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): empty socket content
其次是
scraper_1 | 2017-05-12 16:35:06 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
是什么导致了这些错误?既然他们有INFO级别,也许我可以忽略它们? (我在https://gitweb.torproject.org/stem.git/看了一下stem的源代码,但到目前为止还没有能够处理正在发生的事情).最佳答案我不知道你对你的问题是否得出了什么结论.
我实际上得到了与你相同的日志消息.我的Scrapy项目表现良好,使用Tor和privoxy的ip轮换也成功.我只是不停地获取日志INFO:[stem]接收控制消息时出错(SocketClosed):空的套接字内容,这让我感到害怕.
我花了一些时间来寻找导致它的原因,并且看看我是否可以忽略它(毕竟,它只是一条信息消息而不是错误消息).
底线是我不知道是什么导致它,但我觉得它是安全的,可以忽略它.
正如日志所说,套接字内容(实际上是包含有关套接字连接的相关信息的stem control_file)是空的.当control_file为空时,它会触发关闭套接字连接(根据python套接字文档).我不确定是什么原因导致control_file为空以关闭套接字连接.但是,如果套接字连接真的关闭,看起来套接字连接成功打开,因为我的scrapy的爬行作业和ip旋转效果很好.虽然我找不到它的真正原因,但我只能假设一些原因:(1)Tor网络不稳定,(2)当你的代码运行controller.signal(Signal.NEWNYM)时,套接字暂时关闭,再次获得开放,或者其他一些我目前无法想到的原因.
总结以上是内存溢出为你收集整理的python – “在接收控制消息时出错(SocketClosed):在Tor的干控制器中出现空插槽内容”全部内容,希望文章能够帮你解决python – “在接收控制消息时出错(SocketClosed):在Tor的干控制器中出现空插槽内容”所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)