线性回归是利用数理统计中回归分析,使用该方法可以确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法,运用十分广泛。
其表达形式为y = w’x+e,e为误差服从均值为0的正态分布。
如何利用线性回归来统计和预测数据?我们先使用最简单得线性模型来预测数据,慢慢走到更深得层次。
数据量得大小决定了模型是否精确,训练数据不足将会使得机器学习得成果很低,因此我们才需要“大数据”。
我们使用得工具是python中得pandas和sklearn,sklearn是tensorflow得核心机器学习中dnn得创建能手,这一次我们先使用sklearn来创建线性回归方程预测,使用一个国外得案例,再使用我们自己得一个心率分析案例。
分析是这样得,得到人均GDP得统计数据,再得到幸福指数数据,看一个问题:“金钱是否让人快乐”?这个模型来自于机器学习实战,
下面我取了25行文本数据,wps打开并不正确
"LOCATION","Country","INDICATOR","Indicator","MEASURE","Measure","INEQUALITY","Inequality","Unit Code","Unit","PowerCode Code","PowerCode","Reference Period Code","Reference Period","Value","Flag Codes","Flags"
"AUS","Australia","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,1.1,"E","Estimated value"
"AUT","Austria","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,1,,
"BEL","Belgium","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,2,,
"CAN","Canada","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.2,,
"CZE","Czech Republic","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.9,,
"DNK","Denmark","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.9,,
"FIN","Finland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.6,,
"FRA","France","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.5,,
"DEU","Germany","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.1,,
"GRC","Greece","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.7,,
"HUN","Hungary","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,4.8,,
"ISL","Iceland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.4,,
"IRL","Ireland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.2,,
"ITA","Italy","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,1.1,,
"JPN","Japan","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,6.4,,
"KOR","Korea","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,4.2,,
"LUX","Luxembourg","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.1,,
"MEX","Mexico","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,4.2,,
"NLD","Netherlands","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0,,
"NZL","New Zealand","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.2,"E","Estimated value"
"NOR","Norway","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.3,,
"POL","Poland","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,3.2,,
"PRT","Portugal","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.9,,
"SVK","Slovak Republic","HO_BASE","Dwellings without basic facilities","L","Value","TOT","Total","PC","Percentage","0","units",,,0.6,,
这是各个不同得国家得数据,说明以下,国外得学者研究数据比较精确和大量(经合组织),这一张表实际上有几千行数据,我只拿了几行,得数据再配合以下得数据
Country Subject Descriptor Units Scale Country/Series-specific Notes 2015 Estimates Start After
Afghanistan Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 599.994 2013
Albania Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 3,995.383 2010
Algeria Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 4,318.135 2014
Angola Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 4,100.315 2014
Antigua and Barbuda Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 14,414.302 2011
Argentina Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 13,588.846 2013
Armenia Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 3,534.860 2014
Australia Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 50,961.865 2014
Austria Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 43,724.031 2015
Azerbaijan Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 5,739.433 2014
The Bahamas Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 23,902.805 2013
Bahrain Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 23,509.981 2014
Bangladesh Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, current prices (National currency) Population (Persons). 1,286.868 2013
下面使用python来拼合数据
def prepare_country_stats(oecd_bli, gdp_per_capita):
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
gdp_per_capita.to_csv("./Test.csv", encoding="utf-8-sig", mode="a", header=True, index=False)
gdp_per_capita.set_index("Country", inplace=True)
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
left_index=True, right_index=True)
full_country_stats.sort_values(by="GDP per capita", inplace=True)
remove_indices = [0, 1, 6, 8, 33, 34, 35]
keep_indices = list(set(range(36)) - set(remove_indices))
return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]
上面这个python函数实际上和我们得拟合数据并没有关系,大家不理解也没有关系,最后我们要得到得数据是这样得:
Country GDP人均 生活满意度
Russia 9054.914 6.0
Turkey 9437.372 5.6
Hungary 12239.894 4.9
Poland 12495.334 5.8
Slovak Republic 15991.736 6.1
Estonia 17288.083 5.6
Greece 18064.288 4.8
Portugal 19121.592 5.1
Slovenia 20732.482 5.7
Spain 25864.721 6.5
Korea 27195.197 5.8
Italy 29866.581 6.0
Japan 32485.545 5.9
Israel 35343.336 7.4
New Zealand 37044.891 7.3
France 37675.006 6.5
Belgium 40106.632 6.9
Germany 40996.511 7.0
Finland 41973.988 7.4
Canada 43331.961 7.3
Netherlands 43603.115 7.3
Austria 43724.031 6.9
United Kingdom 43770.688 6.8
Sweden 49866.266 7.2
Iceland 50854.583 7.5
Australia 50961.865 7.3
Ireland 51350.744 7.0
Denmark 52114.165 7.5
United States 55805.204 7.2
code
import matplotlib
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
#csv 文件得路径
datapath = os.path.join("datasets", "lifesat", "")
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv",thousands=',',delimiter='\t',
encoding='latin1', na_values="n/a")
#准备数据
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
print(country_stats)
#可视化散点图
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()
#选择线性模型
model = sklearn.linear_model.LinearRegression()
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
#训练模型
model.fit(X, y)
#做出预测
X_new = [[22587]] # Cyprus' GDP per capita
print(model.predict(X_new))
散点图如下所示
最后22587 GDP得收入得到得满意度为5.96242338,可以看出基本是一个线性模型,实际上是我们事先已经预定了这个模型,收入高,相对来说生活满意度是要高,假定是和环境,地理,其他等等综合因素一起来看,以严谨来看,这个模型肯定是缺少一些东西,不要紧,我们是掌握方法,实际上并没有真得要预测我们得生活满意度,下面使用我们公司自己得养老数据,年龄和心跳来做一个线性分析,也就是假定年龄和心跳是线性得。
以便于揭示年龄和心跳得关系
code2import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.linear_model
age = [1, 2, 3, 10, 12, 15, 16,18,20,25,28,30,34,40,50,60]#年龄
heart = [90,87,85,78, 76, 77, 75,74,65,64,67,70,72,76,64,65]#心率均值
full = np.c_[age,heart]
data = pd.DataFrame(data=full,columns=['age','heart'])
#full = pd.merge(left=age,right=heart,how='outer')
print(data)
data.plot(kind='scatter', x="age", y='heart')
plt.show()
X = np.c_[data["age"]]
Y = np.c_[data["heart"]]
model = sklearn.linear_model.LinearRegression()
model.fit(X, Y)
X_new = [[14]]
print(model.predict(X_new))
得到数据如下所示
age heart
0 1 90
1 2 87
2 3 85
3 10 78
4 12 76
5 15 77
6 16 75
7 18 74
8 20 65
9 25 64
10 28 67
11 30 70
12 34 72
13 40 76
14 50 64
15 60 65
得到的散点图如下所示:
最后输入我们需要预测的值14岁,得到[[77.27657005]]。
我们使用现有的数据和使用简单的线性模型来预测关系,实际上,模型的选择并不一定就是正确的,事实上,有更为复杂的关系在里面,后面我们将会慢慢进入分类器,多标签分类,训练的更多模型,如岭回归,d性网络,概率估算,支持向量机,决策树,随机森林,降维,神经网络等等机器学习。
等待下一章节了。
。
。
。
。
待续。
。
。
。
。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)