Scrapy start_urls_随笔

Scrapy start_urls

start_urlsclass

属性包含起始网址-仅此而已。如果你要提取其他网页的网址，

parse

请使用[another]回调从相应的回调请求中获取收益：

class Spider(baseSpider):    name = 'my_spider'    start_urls = [     'http://www.domain.com/'    ]    allowed_domains = ['domain.com']    def parse(self, response):        '''Parse main page and extract categories links.'''        hxs = HtmlXPathSelector(response)        urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract()        for url in urls: url = urlparse.urljoin(response.url, url) self.log('Found category url: %s' % url) yield Request(url, callback = self.parseCategory)    def parseCategory(self, response):        '''Parse category page and extract links of the items.'''        hxs = HtmlXPathSelector(response)        links = hxs.select("//*[@id='_list']//td[@]/a/@href").extract()        for link in links: itemlink = urlparse.urljoin(response.url, link) self.log('Found item link: %s' % itemlink, log.DEBUG) yield Request(itemlink, callback = self.parseItem)    def parseItem(self, response):        ...

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/4921392.html

Scrapy start_urls

发表评论

评论列表（0条）