Scrapy start_urls

Scrapy start_urls,第1张

Scrapy start_urls

start_urlsclass
属性包含起始网址-仅此而已。如果你要提取其他网页的网址,
parse
请使用[another]回调从相应的回调请求中获取收益:

class Spider(baseSpider):    name = 'my_spider'    start_urls = [     'http://www.domain.com/'    ]    allowed_domains = ['domain.com']    def parse(self, response):        '''Parse main page and extract categories links.'''        hxs = HtmlXPathSelector(response)        urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract()        for url in urls: url = urlparse.urljoin(response.url, url) self.log('Found category url: %s' % url) yield Request(url, callback = self.parseCategory)    def parseCategory(self, response):        '''Parse category page and extract links of the items.'''        hxs = HtmlXPathSelector(response)        links = hxs.select("//*[@id='_list']//td[@]/a/@href").extract()        for link in links: itemlink = urlparse.urljoin(response.url, link) self.log('Found item link: %s' % itemlink, log.DEBUG) yield Request(itemlink, callback = self.parseItem)    def parseItem(self, response):        ...


欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/4921392.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-11-12
下一篇 2022-11-12

发表评论

登录后才能评论

评论列表(0条)

保存