pyspider爬取tripadvisor_app

概述首先装pymongo,pyspider,具体安装方法不讲解，然后命令行下执行 pyspider all 这句命令的意思是，运行 pyspider 并启动它的所有组件。可以发现程序已经正常启动，并在 5000 这个端口运行。下来在浏览器中输入 http://localhost:5000，可以看到 PySpider 的主界面，点击右下角的 Create，命名为 pyshiderlianxi

首先装pymongo,pyspIDer,具体安装方法不讲解，然后

命令行下执行

pyspIDer all

这句命令的意思是，运行 pyspIDer 并启动它的所有组件。

可以发现程序已经正常启动，并在 5000 这个端口运行。

下来在浏览器中输入 http://localhost:5000，可以看到 PySpIDer 的主界面，点击右下角的 Create，命名为 pyshIDerlianxi，当然名称你可以随意取，继续点击 Create。

整个页面分为两栏，左边是爬取页面预览区域，右边是代码编写区域。下面对区块进行说明：

左侧绿色区域：这个请求对应的 JsON 变量，在 PySpIDer 中，其实每个请求都有与之对应的 JsON 变量，包括回调函数，方法名，请求链接，请求数据等等。

绿色区域右上角Run：点击右上角的 run 按钮，就会执行这个请求，可以在左边的白色区域出现请求的结果。

左侧 enable CSS selector helper: 抓取页面之后，点击此按钮，可以方便地获取页面中某个元素的 CSS 选择器。

左侧 web: 即抓取的页面的实时预览图。

左侧 HTML: 抓取页面的 HTML 代码。

左侧 follows: 如果当前抓取方法中又新建了爬取请求，那么接下来的请求就会出现在 follows 里。

左侧 messages: 爬取过程中输出的一些信息。

右侧代码区域: 你可以在右侧区域书写代码，并点击右上角的 Save 按钮保存。

右侧 WebDAV Mode: 打开调试模式，左侧最大化，便于观察调试。

#!/usr/bin/env python# -*- enCoding: utf-8 -*-# Created on 2018-08-23 08:49:39# Project: pyspIDerlianxifrom pyspIDer.libs.base_handler import *import pymongoclass Handler(BaseHandler):    crawl_config = {    }        clIEnt = pymongo.MongoClIEnt(host=‘localhost‘,port=27017)    db = clIEnt[‘trip‘]    @every(minutes=24 * 60)    def on_start(self):        self.crawl(‘https://www.tripadvisor.cn/Attractions-g294211-ActivitIEs-c47-China.HTML‘,callback=self.index_page,valIDate_cert=False)  
     #爬取索引页    @config(age=10 * 24 * 60 * 60)    def index_page(self,response):        for each in response.doc(‘div.Listing_info > div.Listing_Title > a‘).items():            self.crawl(each.attr.href,callback=self.detail_page,valIDate_cert=False)               爬取详情页#    @config(priority=2)    def detail_page(self,response):        url = response.url,Title = response.doc(‘Title‘).text(),name = response.doc(‘#heading‘).text(),paiming = response.doc(‘span.header_rating > div > a > span‘).text(),phonenum = response.doc(‘div.blEntry.phone > span:nth-child(2)‘).text(),dizhi = response.doc(‘div.detail_section.address.xh-highlight‘).text(),youwanshijian = response.doc(‘#taplc_attraction_detail_Listing_0 > div.section.hours > div‘).text()                   return {             "url":url,"Title":Title,"name":name,"paiming":paiming,"phonenum":phonenum,"dizhi":dizhi,"youwanshijian":youwanshijian                   }    #结果存入数据库中    def on_result(self,result):        if result:            self.save_to_mongo(result)               def save_to_mongo(self,result):        if self.db[‘chinastrip‘].insert(result):            print(‘save to mongo‘,result)

总结

以上是内存溢出为你收集整理的pyspider爬取tripadvisor全部内容，希望文章能够帮你解决pyspider爬取tripadvisor所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/web/1086320.html

pyspider爬取tripadvisor

发表评论

评论列表（0条）