python网络爬虫（11）近期电影票房或热度信息爬取_python

概述目标意义为了理解动态网站中一些数据如何获取，做一个简单的分析。说明思路，原始代码来源于：https://book.douban.com/subject/27061630/。构造-下载器构造分下载器，下载原始网页，用于原始网页的获取，动态网页中，js部分的响应获取。通过浏览器模仿，合理制作请求头，获取网页信息即可。代码如下： import requestsimport charde 目标意义

为了理解动态网站中一些数据如何获取，做一个简单的分析。

说明

思路，原始代码来源于：https://book.douban.com/subject/27061630/。

构造-下载器

构造分下载器，下载原始网页，用于原始网页的获取，动态网页中，Js部分的响应获取。

通过浏览器模仿，合理制作请求头，获取网页信息即可。

代码如下：

import requestsimport chardetclass HTMLDownloader(object):    def download(self,url):        if url is None:            return None        user_agent=‘Mozilla/5.0 (windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0‘        headers={‘User-Agent‘:user_agent}        r=requests.get(url,headers=headers)        if r.status_code is 200:            r.enCoding=chardet.detect(r.content)[‘enCoding‘]            return r.text        return None

构造-解析器

解析器解析数据使用。

获取的票房信息，电影名称等，使用解析器完成。

被解析的动态数据来源于Js部分的代码。

Js地址的获取则通过F12控制台-->网络-->Js，然后观察，得到。

地址如正上映的电影：

http://service.library.mtime.com/MovIE.API?AJAX_CallBack=true&AJAX_CallBackType=Mtime.library.Services&AJAX_CallBackMethod=GetMovIEOvervIEwrating&AJAX_CrossDomain=1&AJAX_RequestUrl=http://movIE.mtime.com/257982/&t=201907121611461266&AJAX_CallBackArgument0=257982

返回信息中，解析出Json格式的部分，通过Json的一些方法，获取其中的票房等信息。

其中，Json解析工具地址如：https://www.json.cn/

未上映的电影是同理的。

这些数据的解析有差异，所以定制了函数分支，处理解析过程中可能遇到的不同情景。

代码如下：

import reimport Jsonclass HTMLParser(object):    def parser_url(self,page_url,response):        pattern=re.compile(r‘(http://movIE.mtime.com/(\d+)/)‘)        urls=pattern.findall(response)        if urls != None:            return List(set(urls))#Duplicate removal        else:            return None            def parser_Json(self,url,response):        #parsing Json. input page_url as Js url and response for parsing        pattern=re.compile(r‘=(.*?);‘)        result=pattern.findall(response)[0]        if result != None:            value=Json.loads(result)            isRelease=value.get(‘value‘).get(‘isRelease‘)            if isRelease:                isRelease=1                return self.parser_Json_release(value,url)            else:                isRelease=0                return self.parser_Json_notRelease(value,url)        return None    def parser_Json_release(self,value,url):        isRelease=1        movIETitle=value.get(‘value‘).get(‘movIETitle‘)        ratingFinal=value.get(‘value‘).get(‘movIErating‘).get(‘ratingFinal‘)        try:            TotalBoxOffice=value.get(‘value‘).get(‘BoxOffice‘).get(‘TotalBoxOffice‘)            TotalBoxOfficeUnit=value.get(‘value‘).get(‘BoxOffice‘).get(‘TotalBoxOfficeUnit‘)        except:            TotalBoxOffice="None"            TotalBoxOfficeUnit="None"        return isRelease,movIETitle,ratingFinal,TotalBoxOffice,TotalBoxOfficeUnit,url            def parser_Json_notRelease(self,url):        isRelease=0        movIETitle=value.get(‘value‘).get(‘movIETitle‘)        try:            ratingFinal=Ranking=value.get(‘value‘).get(‘hotValue‘).get(‘Ranking‘)        except:            ratingFinal=-1        TotalBoxOffice=‘None‘        TotalBoxOfficeUnit=‘None‘        return isRelease,url

构造-存储器

存储方案为sqlite，所以在解析器中isRelease部分，使用了0和1进行的存储。

存储需要连接sqlite3，创建数据库，获取执行数据库语句的方法，插入数据等。

按照原作者思路，存储时，先暂时存储到内存中，条数大于10以后，将内存中的数据插入到sqlite数据库中。

代码如下：

import sqlite3class DataOutput(object):    def __init__(self):        self.cx=sqlite3.connect("MTime.db")        self.create_table(‘MTime‘)        self.datas=[]        def create_table(self,table_name):        values=‘‘‘        ID integer primary key autoincrement,isRelease boolean not null,movIETitle varchar(50) not null,ratingFinal_HotValue real not null default 0.0,TotalBoxOffice varchar(20),TotalBoxOfficeUnit varchar(10),sourceUrl varchar(300)        ‘‘‘        self.cx.execute(‘create table if not exists %s(%s)‘ %(table_name,values))            def store_data(self,data):        if data is None:            return        self.datas.append(data)        if len(self.datas)>10:            self.output_db(‘MTime‘)                def output_db(self,table_name):        for data in self.datas:            cmd="insert into %s (isRelease,ratingFinal_HotValue,sourceUrl) values %s" %(table_name,data)            self.cx.execute(cmd)            self.datas.remove(data)        self.cx.commit()            def output_end(self):        if len(self.datas)>0:            self.output_db(‘MTime‘)        self.cx.close()

主函数部分

创建以上对象作为初始化

然后获取根路径。从根路径下找到百余条电影网址信息。

对每个电影网址信息一一解析，然后存储。

import HTMLDownloaderimport HTMLParserimport DataOutputimport timeclass SpIDer(object):    def __init__(self):        self.downloader=HTMLDownloader.HTMLDownloader()        self.parser=HTMLParser.HTMLParser()        self.output=DataOutput.DataOutput()        def crawl(self,root_url):        content=self.downloader.download(root_url)        urls=self.parser.parser_url(root_url,content)        for url in urls:            print(‘.‘)            t=time.strftime("%Y%m%d%H%M%s1266",time.localtime())            rank_url=‘http://service.library.mtime.com/MovIE.API‘            ‘?AJAX_CallBack=true‘            ‘&AJAX_CallBackType=Mtime.library.Services‘            ‘&AJAX_CallBackMethod=GetMovIEOvervIEwrating‘            ‘&AJAX_CrossDomain=1‘            ‘&AJAX_RequestUrl=%s‘            ‘&t=%s‘            ‘&AJAX_CallBackArgument0=%s‘ %(url[0],t,url[1])            rank_content=self.downloader.download(rank_url)            try:                data=self.parser.parser_Json(rank_url,rank_content)            except:                print(rank_url)            self.output.store_data(data)        self.output.output_end()        print(‘ed‘)if __name__==‘__main__‘:    spIDer=SpIDer()    spIDer.crawl(‘http://theater.mtime.com/China_Beijing/‘)

当前效果

如下：

总结

以上是内存溢出为你收集整理的python网络爬虫（11）近期电影票房或热度信息爬取全部内容，希望文章能够帮你解决python网络爬虫（11）近期电影票房或热度信息爬取所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1197989.html

python网络爬虫（11）近期电影票房或热度信息爬取

发表评论

评论列表（0条）