Python-天天基金网爬虫分析

Python-天天基金网爬虫分析,第1张

概述一、选题背景为什么要选择此选题?要达到的数据分析的预期目标是什么?随着互联网进入大数据时代,人们获取咨询的方法越来越多,而财经信息又与人们的生活息息相关,所以关于财经的信息就有为重要,为了能更快更好的了解市场基金的走向,我选择了这个课题,主要为了更方便了解有关基金的动态。

一、选题背景

为什么要选择此选题?要达到的数据分析的预期目标是什么?

随着互联网进入大数据时代,人们获取咨询的方法越来越多,而财经信息又与人们的生活息息相关,所以关于财经的信息就有为重要,为了能更快更好的了解市场基金的走向,我选择了这个课题,主要为了更方便了解有关基金的动态。

二、主题式网络爬虫设计方案

1.主题式网络爬虫名称:天天基金网爬虫分析

2.主题式网络爬虫爬取的内容与数据特征分析:通过访问天天基金的网站,爬取相对应的信息,最后保存下来做可视化分析。

3.主题式网络爬虫设计方案概述(包括实现思路与技术难点):

首先,用request进行访问页面。

其次,用xtree来获取页面内容,用etree.xpath进行数据筛选。

最后,文件 *** 作进行数据的保存。

难点:网站的爬取与数据筛选。

技术难点:

三、主题页面的结构特征分析

1.主题页面的结构与特征分析

数据来源:http://fund.eastmoney.com/fund.HTML

 

 

 2.HTMLs 页面解析

四、网络爬虫程序设计

爬虫程序主体要包括以下各部分,要附源代码及较详细注释,并在每部分程序后面提供输出结果的截图。

1.数据爬取与采集

"""ua大列表"""
USER_AGENT_List = [
'Mozilla/5.0 (windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 firefox/57.0',
'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 firefox/3.6.2',
'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; windows NT 6.1; TrIDent/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; tablet PC 2.0)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; windows NT 6.1; TrIDent/6.0; touch; MASMJs)',
'Mozilla/5.0 (X11; linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 firefox/57.0',
'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',
'Mozilla/5.0 (windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 firefox/3.6.2',
'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',
'Mozilla/4.0 (compatible; MSIE 8.0; windows NT 6.1; TrIDent/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; tablet PC 2.0)',

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',
'Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.0; windows NT 6.1; TrIDent/6.0; touch; MASMJs)',
'Mozilla/5.0 (X11; linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'Mozilla/5.0 (windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',
'Mozilla/5.0 (windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',
'Mozilla/5.0 (windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 firefox/75.0',
'Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',
'Mozilla/5.0 (windows; U; windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 Vivobrowser/8.4.72.0 Chrome/62.0.3202.84',
'Mozilla/5.0 (windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 firefox/83.0',
'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 firefox/68.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (X11; linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',
]

 

 

 2.对数据进行清洗和处理

def __init__(self):        # 起始的请求地址----初始化        self.start_url = 'http://fund.eastmoney.com/fund.HTML'        # 第二份数据地址        self.next_url = 'http://fund.eastmoney.com/HBJJ_pJsyl.HTML'    def parse_start_url(self):        """        发送请求,获取响应        :return:        """        # 请求头        headers = {            # 通过随机模块提供的随机拿取数据方法            'User-Agent': random.choice(USER_AGENT_List)        }        # 发送请求,获取响应字节数据        response = session.get(self.start_url, headers=headers).content        """序列化对象,将字节内容数据,经过转换,变成可进行xpath *** 作的对象"""        response = etree.HTML(response)        """调用提取第二份响应数据"""        self.parse_next_url_response(response)    def parse_next_url_response(self, response_1):        """        解析第二个数据页地址        :return:        """        # 请求头        headers = {            # 通过随机模块提供的随机拿取数据方法            'User-Agent': random.choice(USER_AGENT_List)        }        # 发送请求,获取响应字节数据        response = session.get(self.start_url, headers=headers).content        """序列化对象,将字节内容数据,经过转换,变成可进行xpath *** 作的对象"""        response = etree.HTML(response)        """调用解析response响应数据方法"""        self.parse_response_data(response, response_1)    def parse_response_data(self, response_1, response):        """        解析response响应数据,提取        :return:        """        # 股票名称        name_List_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')        name_List_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')        # 合并        name_List = name_List_1 + name_List_2        # 昨日单位净值        num_1_List_data_1 = response.xpath('//tbody/tr/td[6]/text()')        num_1_List_data_2 = response_1.xpath('//tr/td[6]/span/text()')        # 合并        num_1_List = num_1_List_data_1 + num_1_List_data_2        # 昨日累计净值        num_2_List_data_1 = response.xpath('//tbody/tr/td[7]/text()')        num_2_List_data_2 = response_1.xpath('//tr/td[7]/text()')        # 合并        num_2_List = num_2_List_data_1 + num_2_List_data_2        """调用解析三个列表的方法"""        self.for_parse_three_List(name_List, num_1_List, num_2_List)    def for_parse_three_List(self, name_List, num_1_List, num_2_List):        """        解析循环,        :param name_List: 股票名称        :param num_1_List: 昨日单位净值        :param num_2_List: 昨日累计净值        :return:        """        # 遍历解析3个列表数据        for a, b, c in zip(name_List, num_1_List, num_2_List):            # 构造保存的excel字典数据            dict_data = {                # 会根据该字典的key值创建工作簿的sheet名                '股票数据': [a, b, c]            }            """调用解析保存excel表格方法"""            self.parse_save_excel(dict_data)            print(f'企业:{a}----采集完成!')        """数据采集完成,调用分析生成图像方法"""        self.parse_random_data(name_List, num_1_List, num_2_List)    def parse_random_data(self, name_List, num_1_List, num_2_List):        """        随机抽取15条数据,进行分析        :return:        """        # 存放随机号码的列表        index_List = []        for i in range(15):            # 随机抽取15个数据进行分析            random_num = random.randint(0, 200)            # 将随机抽取的号码添加进入准备的列表中            index_List.append(random_num)        """随机号码生成以后,调用解析生成四张分析图的方法"""        self.parse_img_four_func(index_List, name_List, num_1_List, num_2_List)

4.数据分析与可视化(例如:数据柱形图、直方图、散点图、盒图、分布图)

def parse_img_four_func(self, index_List, name_List, num_1_List, num_2_List):        """        解析生成四张分析图        :param index_List: 随机数据的下标        :param name_List: 股票名称列表        :param num_1_List: 昨日单位净值列表        :param num_2_List: 昨日累计净值列表        :return:        """        Title_List = []  # 名称        qy_num_1 = []    # 单位净值        qy_num_2 = []    # 累计净值        for index_num in index_List:            # 企业名称列表            Title_List.append(name_List[index_num])            # 昨日单位净值列表            qy_num_1.append(num_1_List[index_num])            # 昨日累计净值列表            qy_num_2.append(num_2_List[index_num])        # 第一张图:根据净值生成折线图        plt.rcParams['Font.@R_301_3726@'] = ['SimHei']        plt.rcParams['axes.unicode_minus'] = False        # plot中参数的含义分别是横轴值,纵轴值,线的形状,颜色,透明度,线的宽度和标签        plt.plot(Title_List, qy_num_2, 'ro-', color='#4169E1', Alpha=0.8, linewidth=1, label='累计净值')        plt.plot(Title_List, qy_num_1, 'ro-', color='#69e141', Alpha=0.8, linewidth=1, label='单位净值')        # 显示标签,如果不加这句,即使在plot中加了label='一些数字'的参数,最终还是不会显示标签        plt.legend(loc="upper right")        plt.xticks(rotation=270)        plt.xlabel('地点数量')        plt.ylabel('工作属性数量')        plt.savefig('根据净值生成折线图.png')        plt.show()        # 第二张图:根据单位净值生成饼图        addr_dict_key = Title_List        addr_dict_value = qy_num_1        plt.rcParams['Font.@R_301_3726@'] = ['Microsoft YaHei']        plt.rcParams['axes.unicode_minus'] = False        plt.pIE(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%')        plt.Title(f'单位净值对比')        plt.savefig(f'单位净值对比-饼图')        plt.show()        # 第三张图:根据累计净值生成散点图        # 这两行代码解决 plt 中文显示的问题        plt.rcParams['Font.@R_301_3726@'] = ['SimHei']        plt.rcParams['axes.unicode_minus'] = False        # 输入岗位地址和岗位属性数据        production = Title_List        tem = qy_num_2        colors = np.random.rand(len(tem))  # 颜色数组        plt.scatter(tem, production, s=200, c=colors)  # 画散点图,大小为 200        plt.xlabel('数量')  # 横坐标轴标题        plt.xticks(rotation=270)        plt.ylabel('名称')  # 纵坐标轴标题        plt.savefig(f'净值散点图.png')        plt.show()        # 第四张图:根据净值生成柱状图        import matplotlib;matplotlib.use('TkAgg')        plt.rcParams['Font.@R_301_3726@'] = ['SimHei']        plt.rcParams['axes.unicode_minus'] = False        zhFont1 = matplotlib.Font_manager.FontPropertIEs(fname='C:\windows\Fonts\simsun.ttc')        name_List = Title_List        num_List = [float(i) for i in qy_num_1]  # 单位净值        wIDth = 0.5  # 柱子的宽度        index = np.arange(len(name_List))        plt.bar(index, num_List, wIDth, color='steelblue', tick_label=name_List, label='单位净值')        plt.bar(index + wIDth, qy_num_2, wIDth, color='red', hatch='\', label='累计净值')        plt.legend(['单位净值', '累计净值'], prop=zhFont1, labelspacing=1)        for a, b in zip(index, num_List):  # 柱子上的数字显示            plt.text(a, b, '%.2f' % b, ha='center', va='bottom', Fontsize=7)        plt.xticks(rotation=270)        plt.Title('净值柱状图')        plt.ylabel('率')        plt.legend()        plt.savefig(f'净值-柱状图', bBox_inches='tight')        plt.show()

 

 

 

 

 

 

 

 

 

 

 

 5.将以上各部分的代码汇总,附上完整程序代码

"""ua大列表"""USER_AGENT_List = [                  'Mozilla/5.0 (windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 firefox/57.0',                  'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',                  'Mozilla/5.0 (windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',                  'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 firefox/3.6.2',                  'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',                  'Mozilla/4.0 (compatible; MSIE 8.0; windows NT 6.1; TrIDent/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; tablet PC 2.0)',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',                  'Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',                  'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',                  'Mozilla/5.0 (compatible; MSIE 10.0; windows NT 6.1; TrIDent/6.0; touch; MASMJs)',                  'Mozilla/5.0 (X11; linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',                  'Mozilla/5.0 (windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 firefox/57.0',                  'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36',                  'Mozilla/5.0 (windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36',                  'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 firefox/3.6.2',                  'Mozilla/5.0 (windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174',                  'Mozilla/4.0 (compatible; MSIE 8.0; windows NT 6.1; TrIDent/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; tablet PC 2.0)',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61',                  'Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1',                  'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36',                  'Mozilla/5.0 (compatible; MSIE 10.0; windows NT 6.1; TrIDent/6.0; touch; MASMJs)',                  'Mozilla/5.0 (X11; linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',                  'Mozilla/5.0 (windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36',                  'Mozilla/5.0 (windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',                  'Mozilla/5.0 (windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36',                  'Mozilla/5.0 (windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 firefox/75.0',                  'Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36',                  'Mozilla/5.0 (windows; U; windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 Vivobrowser/8.4.72.0 Chrome/62.0.3202.84',                  'Mozilla/5.0 (windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 firefox/83.0',                  'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 firefox/68.0',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',                  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',                  'Mozilla/5.0 (windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',                  'Mozilla/5.0 (X11; linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36',                  ]from requests_HTML import HTMLSessionimport os, xlwt, xlrd, randomfrom xlutils.copy import copyimport numpy as npfrom matplotlib import pyplot as pltfrom matplotlib.Font_manager import FontPropertIEs  # 字体库from lxml import etreesession = HTMLSession()class DFSpIDer(object):    def __init__(self):        # 起始的请求地址----初始化        self.start_url = 'http://fund.eastmoney.com/fund.HTML'        # 第二份数据地址        self.next_url = 'http://fund.eastmoney.com/HBJJ_pJsyl.HTML'    def parse_start_url(self):        """        发送请求,获取响应        :return:        """        # 请求头        headers = {            # 通过随机模块提供的随机拿取数据方法            'User-Agent': random.choice(USER_AGENT_List)        }        # 发送请求,获取响应字节数据        response = session.get(self.start_url, headers=headers).content        """序列化对象,将字节内容数据,经过转换,变成可进行xpath *** 作的对象"""        response = etree.HTML(response)        """调用提取第二份响应数据"""        self.parse_next_url_response(response)    def parse_next_url_response(self, response_1):        """        解析第二个数据页地址        :return:        """        # 请求头        headers = {            # 通过随机模块提供的随机拿取数据方法            'User-Agent': random.choice(USER_AGENT_List)        }        # 发送请求,获取响应字节数据        response = session.get(self.start_url, headers=headers).content        """序列化对象,将字节内容数据,经过转换,变成可进行xpath *** 作的对象"""        response = etree.HTML(response)        """调用解析response响应数据方法"""        self.parse_response_data(response, response_1)    def parse_response_data(self, response_1, response):        """        解析response响应数据,提取        :return:        """        # 股票名称        name_List_1 = response.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')        name_List_2 = response_1.xpath('//tbody/tr/td[5]/nobr/a[1]/text()')        # 合并        name_List = name_List_1 + name_List_2        # 昨日单位净值        num_1_List_data_1 = response.xpath('//tbody/tr/td[6]/text()')        num_1_List_data_2 = response_1.xpath('//tr/td[6]/span/text()')        # 合并        num_1_List = num_1_List_data_1 + num_1_List_data_2        # 昨日累计净值        num_2_List_data_1 = response.xpath('//tbody/tr/td[7]/text()')        num_2_List_data_2 = response_1.xpath('//tr/td[7]/text()')        # 合并        num_2_List = num_2_List_data_1 + num_2_List_data_2        """调用解析三个列表的方法"""        self.for_parse_three_List(name_List, num_1_List, num_2_List)    def for_parse_three_List(self, name_List, num_1_List, num_2_List):        """        解析循环,        :param name_List: 股票名称        :param num_1_List: 昨日单位净值        :param num_2_List: 昨日累计净值        :return:        """        # 遍历解析3个列表数据        for a, b, c in zip(name_List, num_1_List, num_2_List):            # 构造保存的excel字典数据            dict_data = {                # 会根据该字典的key值创建工作簿的sheet名                '股票数据': [a, b, c]            }            """调用解析保存excel表格方法"""            self.parse_save_excel(dict_data)            print(f'企业:{a}----采集完成!')        """数据采集完成,调用分析生成图像方法"""        self.parse_random_data(name_List, num_1_List, num_2_List)    def parse_random_data(self, name_List, num_1_List, num_2_List):        """        随机抽取15条数据,进行分析        :return:        """        # 存放随机号码的列表        index_List = []        for i in range(15):            # 随机抽取15个数据进行分析            random_num = random.randint(0, 200)            # 将随机抽取的号码添加进入准备的列表中            index_List.append(random_num)        """随机号码生成以后,调用解析生成四张分析图的方法"""        self.parse_img_four_func(index_List, name_List, num_1_List, num_2_List)    def parse_img_four_func(self, index_List, name_List, num_1_List, num_2_List):        """        解析生成四张分析图        :param index_List: 随机数据的下标        :param name_List: 股票名称列表        :param num_1_List: 昨日单位净值列表        :param num_2_List: 昨日累计净值列表        :return:        """        Title_List = []  # 名称        qy_num_1 = []    # 单位净值        qy_num_2 = []    # 累计净值        for index_num in index_List:            # 企业名称列表            Title_List.append(name_List[index_num])            # 昨日单位净值列表            qy_num_1.append(num_1_List[index_num])            # 昨日累计净值列表            qy_num_2.append(num_2_List[index_num])        # 第一张图:根据净值生成折线图        plt.rcParams['Font.@R_301_3726@'] = ['SimHei']        plt.rcParams['axes.unicode_minus'] = False        # plot中参数的含义分别是横轴值,纵轴值,线的形状,颜色,透明度,线的宽度和标签        plt.plot(Title_List, qy_num_2, 'ro-', color='#4169E1', Alpha=0.8, linewidth=1, label='累计净值')        plt.plot(Title_List, qy_num_1, 'ro-', color='#69e141', Alpha=0.8, linewidth=1, label='单位净值')        # 显示标签,如果不加这句,即使在plot中加了label='一些数字'的参数,最终还是不会显示标签        plt.legend(loc="upper right")        plt.xticks(rotation=270)        plt.xlabel('地点数量')        plt.ylabel('工作属性数量')        plt.savefig('根据净值生成折线图.png')        plt.show()        # 第二张图:根据单位净值生成饼图        addr_dict_key = Title_List        addr_dict_value = qy_num_1        plt.rcParams['Font.@R_301_3726@'] = ['Microsoft YaHei']        plt.rcParams['axes.unicode_minus'] = False        plt.pIE(addr_dict_value, labels=addr_dict_key, autopct='%1.1f%%')        plt.Title(f'单位净值对比')        plt.savefig(f'单位净值对比-饼图')        plt.show()        # 第三张图:根据累计净值生成散点图        # 这两行代码解决 plt 中文显示的问题        plt.rcParams['Font.@R_301_3726@'] = ['SimHei']        plt.rcParams['axes.unicode_minus'] = False        # 输入岗位地址和岗位属性数据        production = Title_List        tem = qy_num_2        colors = np.random.rand(len(tem))  # 颜色数组        plt.scatter(tem, production, s=200, c=colors)  # 画散点图,大小为 200        plt.xlabel('数量')  # 横坐标轴标题        plt.xticks(rotation=270)        plt.ylabel('名称')  # 纵坐标轴标题        plt.savefig(f'净值散点图.png')        plt.show()        # 第四张图:根据净值生成柱状图        import matplotlib;matplotlib.use('TkAgg')        plt.rcParams['Font.@R_301_3726@'] = ['SimHei']        plt.rcParams['axes.unicode_minus'] = False        zhFont1 = matplotlib.Font_manager.FontPropertIEs(fname='C:\windows\Fonts\simsun.ttc')        name_List = Title_List        num_List = [float(i) for i in qy_num_1]  # 单位净值        wIDth = 0.5  # 柱子的宽度        index = np.arange(len(name_List))        plt.bar(index, num_List, wIDth, color='steelblue', tick_label=name_List, label='单位净值')        plt.bar(index + wIDth, qy_num_2, wIDth, color='red', hatch='\', label='累计净值')        plt.legend(['单位净值', '累计净值'], prop=zhFont1, labelspacing=1)        for a, b in zip(index, num_List):  # 柱子上的数字显示            plt.text(a, b, '%.2f' % b, ha='center', va='bottom', Fontsize=7)        plt.xticks(rotation=270)        plt.Title('净值柱状图')        plt.ylabel('率')        plt.legend()        plt.savefig(f'净值-柱状图', bBox_inches='tight')        plt.show()    def parse_save_excel(self, data_dict):        """        保存数据        :return:        """        # 判断保存数据的文件夹是否存在,不存在,就创建        os_path_1 = os.getcwd() + '/数据/'        if not os.path.exists(os_path_1):            os.mkdir(os_path_1)        os_path = os_path_1 + '股票数据.xls'        if not os.path.exists(os_path):            # 创建新的workbook(其实就是创建新的excel)            workbook = xlwt.Workbook(enCoding='utf-8')            # 创建新的sheet表            worksheet1 = workbook.add_sheet("股票数据", cell_overwrite_ok=True)            excel_data_1 = ('股票名称', '昨日单位净值', '昨日累计净值')            for i in range(0, len(excel_data_1)):                worksheet1.col(i).wIDth = 2560 * 3                #               行,列,  内容,            样式                worksheet1.write(0, i, excel_data_1[i])            workbook.save(os_path)        # 判断工作表是否存在        if os.path.exists(os_path):            # 打开工作薄            workbook = xlrd.open_workbook(os_path)            # 获取工作薄中所有表的个数            sheets = workbook.sheet_names()            for i in range(len(sheets)):                for name in data_dict.keys():                    worksheet = workbook.sheet_by_name(sheets[i])                    # 获取工作薄中所有表中的表名与数据名对比                    if worksheet.name == name:                        # 获取表中已存在的行数                        rows_old = worksheet.nrows                        # 将xlrd对象拷贝转化为xlwt对象                        new_workbook = copy(workbook)                        # 获取转化后的工作薄中的第i张表                        new_worksheet = new_workbook.get_sheet(i)                        for num in range(0, len(data_dict[name])):                            new_worksheet.write(rows_old, num, data_dict[name][num])                        new_workbook.save(os_path)    def run(self):        """        启动方法        :return:        """        self.parse_start_url()if __name__ == '__main__':    d = DFSpIDer()    d.run()

五、总结

通过这次的课程设计实验,我对Python又有了进一步的了解,也对Python的爬虫技术有了更熟练的 *** 作,在实验制作过程中也遇到了很多问题,但都通过同学、老师的帮助以及自己上网搜集到的资料从而能够完成此次的实验。

在此次实验中,我发现自己还是有很多的不足,以及对Python学习存在许多盲区,从而让我对Python的学习预发重视。

 

总结

以上是内存溢出为你收集整理的Python-天天基金网爬虫分析全部内容,希望文章能够帮你解决Python-天天基金网爬虫分析所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1159396.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-01
下一篇 2022-06-01

发表评论

登录后才能评论

评论列表(0条)

保存