大数据分析机器学习

大数据分析机器学习,第1张

1、线性回归

线性回归是利用数理统计中回归分析,使用该方法可以确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法,运用十分广泛。


其表达形式为y = w’x+e,e为误差服从均值为0的正态分布。


如何利用线性回归来统计和预测数据?我们先使用最简单得线性模型来预测数据,慢慢走到更深得层次。


2、数据量

数据量得大小决定了模型是否精确,训练数据不足将会使得机器学习得成果很低,因此我们才需要“大数据”。


3、sklearn

我们使用得工具是python中得pandas和sklearn,sklearn是tensorflow得核心机器学习中dnn得创建能手,这一次我们先使用sklearn来创建线性回归方程预测,使用一个国外得案例,再使用我们自己得一个心率分析案例。


3.1 生活满意度分析

分析是这样得,得到人均GDP得统计数据,再得到幸福指数数据,看一个问题:“金钱是否让人快乐”?这个模型来自于机器学习实战,

下面我取了25行文本数据,wps打开并不正确

"LOCATION","Country","INDICATOR","Indicator","MEASURE","Measure","INEQUALITY","Inequality","Unit Code","Unit","PowerCode Code","PowerCode","Reference Period Code","Reference Period","Value","Flag Codes","Flags"
"AUS","Australia","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,1.1,"E","Estimated value"
"AUT","Austria","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,1,,
"BEL","Belgium","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,2,,
"CAN","Canada","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.2,,
"CZE","Czech Republic","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.9,,
"DNK","Denmark","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.9,,
"FIN","Finland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.6,,
"FRA","France","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.5,,
"DEU","Germany","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.1,,
"GRC","Greece","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.7,,
"HUN","Hungary","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,4.8,,
"ISL","Iceland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.4,,
"IRL","Ireland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.2,,
"ITA","Italy","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,1.1,,
"JPN","Japan","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,6.4,,
"KOR","Korea","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,4.2,,
"LUX","Luxembourg","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.1,,
"MEX","Mexico","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,4.2,,
"NLD","Netherlands","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0,,
"NZL","New Zealand","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.2,"E","Estimated value"
"NOR","Norway","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.3,,
"POL","Poland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,3.2,,
"PRT","Portugal","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.9,,
"SVK","Slovak Republic","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.6,,

这是各个不同得国家得数据,说明以下,国外得学者研究数据比较精确和大量(经合组织),这一张表实际上有几千行数据,我只拿了几行,得数据再配合以下得数据

Country	Subject Descriptor	Units	Scale	Country/Series-specific Notes	2015	Estimates Start After
Afghanistan	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	599.994	2013
Albania	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	3,995.383	2010
Algeria	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	4,318.135	2014
Angola	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	4,100.315	2014
Antigua and Barbuda	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	14,414.302	2011
Argentina	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	13,588.846	2013
Armenia	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	3,534.860	2014
Australia	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	50,961.865	2014
Austria	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	43,724.031	2015
Azerbaijan	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	5,739.433	2014
The Bahamas	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	23,902.805	2013
Bahrain	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	23,509.981	2014
Bangladesh	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for:  Gross domestic product, current prices (National currency) Population (Persons).	1,286.868	2013

下面使用python来拼合数据

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)

    gdp_per_capita.to_csv("./Test.csv", encoding="utf-8-sig", mode="a", header=True, index=False)

    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

上面这个python函数实际上和我们得拟合数据并没有关系,大家不理解也没有关系,最后我们要得到得数据是这样得:

Country                 GDP人均                生活满意度
Russia                 9054.914                6.0
Turkey                 9437.372                5.6
Hungary               12239.894                4.9
Poland                12495.334                5.8
Slovak Republic       15991.736                6.1
Estonia               17288.083                5.6
Greece                18064.288                4.8
Portugal              19121.592                5.1
Slovenia              20732.482                5.7
Spain                 25864.721                6.5
Korea                 27195.197                5.8
Italy                 29866.581                6.0
Japan                 32485.545                5.9
Israel                35343.336                7.4
New Zealand           37044.891                7.3
France                37675.006                6.5
Belgium               40106.632                6.9
Germany               40996.511                7.0
Finland               41973.988                7.4
Canada                43331.961                7.3
Netherlands           43603.115                7.3
Austria               43724.031                6.9
United Kingdom        43770.688                6.8
Sweden                49866.266                7.2
Iceland               50854.583                7.5
Australia             50961.865                7.3
Ireland               51350.744                7.0
Denmark               52114.165                7.5
United States         55805.204                7.2
code
import matplotlib
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
#csv 文件得路径
datapath = os.path.join("datasets", "lifesat", "")
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter='\t',
                             encoding='latin1', na_values="n/a")
#准备数据
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
print(country_stats)
#可视化散点图
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()
#选择线性模型
model = sklearn.linear_model.LinearRegression()
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
#训练模型
model.fit(X, y)
#做出预测
X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new))
散点图如下所示


最后22587 GDP得收入得到得满意度为5.96242338,可以看出基本是一个线性模型,实际上是我们事先已经预定了这个模型,收入高,相对来说生活满意度是要高,假定是和环境,地理,其他等等综合因素一起来看,以严谨来看,这个模型肯定是缺少一些东西,不要紧,我们是掌握方法,实际上并没有真得要预测我们得生活满意度,下面使用我们公司自己得养老数据,年龄和心跳来做一个线性分析,也就是假定年龄和心跳是线性得。


以便于揭示年龄和心跳得关系

code2
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd
import sklearn.linear_model
age =   [1, 2, 3, 10, 12, 15, 16,18,20,25,28,30,34,40,50,60]#年龄
heart = [90,87,85,78, 76, 77, 75,74,65,64,67,70,72,76,64,65]#心率均值
full = np.c_[age,heart]
data = pd.DataFrame(data=full,columns=['age','heart'])
#full = pd.merge(left=age,right=heart,how='outer')
print(data)

data.plot(kind='scatter', x="age", y='heart')
plt.show()
X = np.c_[data["age"]]
Y = np.c_[data["heart"]]
model = sklearn.linear_model.LinearRegression()
model.fit(X, Y)
X_new = [[14]]
print(model.predict(X_new))

得到数据如下所示

     age  heart
0     1     90
1     2     87
2     3     85
3    10     78
4    12     76
5    15     77
6    16     75
7    18     74
8    20     65
9    25     64
10   28     67
11   30     70
12   34     72
13   40     76
14   50     64
15   60     65

得到的散点图如下所示:

最后输入我们需要预测的值14岁,得到[[77.27657005]]。


总结

我们使用现有的数据和使用简单的线性模型来预测关系,实际上,模型的选择并不一定就是正确的,事实上,有更为复杂的关系在里面,后面我们将会慢慢进入分类器,多标签分类,训练的更多模型,如岭回归,d性网络,概率估算,支持向量机,决策树,随机森林,降维,神经网络等等机器学习。


等待下一章节了。










待续。










欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/589733.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-04-13
下一篇 2022-04-13

发表评论

登录后才能评论

评论列表(0条)

保存