1-线性回归之单变量线性回归基本原理的python实现_python

文章目录

单变量线性回归基本原理的python实现
- 1大环境准备
- 2jupyter工作路径的更改
- 3单变量线性回归
- - 2.1数据读取
  - 2.2特征构造
  - 2.3其他准备
  - 2.4线性回归主体
  - - 2.4.1计算代价函数
    - 2.4.2梯度下降+拟合
    - 2.4.3实际调用并拟合线性回归
- 参考文章

单变量线性回归基本原理的python实现

1、看了吴恩达机器学习课程关于线性回归的讲述，这个文章是对应该课程的线性回归练习

2、代码是看了网上有人分享的线性回归的python版本实现，这篇文章是结合代码做了讲解
3、相关附件已上传到个人下载部分。

1大环境准备

先建立虚拟环境

# 建立虚拟环境
conda create -n exec_py36 pip python=3.6

下面目的是在jupyter notebook中使用自己创建的虚拟环境

# 1、打开Anaconda prompt，并激活虚拟环境
conda activate exec_py36

# 2、安装ipykernel，用于 *** 控jupyter内核
pip install ipykernel -i https://pypi.douban.com/simple #使用了豆瓣源

# 3、将虚拟环境引入jupyter notebook
python -m ipykernel install --user --name exec_py36 --display-name "Python [conda env:exec_py36]"

重新打开jupyter notebook就可以看到虚拟环境了，直接点击切换即可

这是切换之后的，可以看到内核已经变了

2jupyter工作路径的更改

首先新建一个ipynb文件，在里面输入如下代码即可查看当前文件的默认位置：

import os
print(os.path.abspath('.'))
# 输出如下：
# C:\Users\yan

可以看到默认的位置是我的用户名之下，下面去修改一下默认位置，这样也好管理文件：

# 进入anaconda prompt，并输入下面的命令
jupyter notebook --generate-config
# 得到如下输出：
# Writing default config to: C:\Users\yan\.jupyter\jupyter_notebook_config.py

打开上面的配置文件，找到# c.NotebookApp.notebook_dir = ''，将这一行的注释删掉，然后在单引号中填写自己新建的文件夹的路径，然后保存。

在开始菜单找到“Jupyte Notebook”快捷键，鼠标右击 – 更多 – 打开文件位置

然后找到对应的“Jupyte Notebook”快捷图标，鼠标右击 – 属性 – 目标，去掉后面的"%USERPROFILE%/"，然后点击“应用”，“确定”，最后重新启动Jupyte Notebook即可。
此时新建一个ipynb文件，随便写一段代码保存，去到之前自己建立的工作路径中，可以看到新建的文件存在了。

这样一方面便于管理，另一方面也方便把文件放到目录里，供jupyter notebook里面调用。

3单变量线性回归

数据集介绍：第一列是每个城市的人口，第二列是每个城市一卡车食物的利润

2.1数据读取

在虚拟环境中安装pandas、numpy、matplotlib、seaborn(一个进一步封装的可视化库)：

# 我是在anaconda prompt里面安装的
conda install pandas
conda install matplotlib
conda install seaborn

在安装pandas的时候会顺带安装numpy，所以就没有再次安装numpy了（其实安装matplotlib时也是会自动安装pandas和numpy的）：

读取数据：

import pandas as pd
df = pd.read_csv('ex1data1.txt', names=['population', 'profit']) # 读取数据并赋予列名
df.head() # 看前五行
df.info() # 查看数据信息
# 结果如下图所示

数据可视化：

import seaborn as sns
sns.set(context="notebook", style="whitegrid", palette="dark") # 设置画图的一些基本配置
import matplotlib.pyplot as plt
# 由于数据只有两列，因此可以使用散点图可视化一下数据，看看是什么样子
sns.lmplot('population', 'profit', df, height=6, fit_reg=False)
plt.show()
# 结果如下图所示，由图可知，数据的分布大致呈现一条直线，所以接下来会采用线性回归进行拟合

2.2特征构造

多变量线性回归的假设函数 h θ ( x ) h_\theta(x) hθ(x)如式(1)所示：

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n (1) h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n \tag{1} hθ(x)=θ0+θ1x1+θ2x2+...+θnxn(1)

为了能够向量化，引入 x 0 = 1 x_0=1 x0=1，则 h θ ( x ) h_\theta(x) hθ(x)就变成式(2)：

h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n (2) h_\theta(x)=\theta_0x_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n \tag{2} hθ(x)=θ0x0+θ1x1+θ2x2+...+θnxn(2)

则参数 θ \theta θ的维度为 θ ∈ R n + 1 \theta \in R^{n+1} θ∈Rn+1，而任意一个训练实例的特征 x x x的维度也是 x ∈ R n + 1 x\in R^{n+1} x∈Rn+1，因此 h θ ( x ) h_\theta(x) hθ(x)向量化的表示为式(3)：

h θ ( x ) = θ T X (3) h_\theta(x)=\theta^TX \tag{3} hθ(x)=θTX(3)

基于上述描述，在读取进来的数据集中实际构造一下 x 0 = 1 x_0=1 x0=1：

# 读取特征
def get_X(df):
#     """
#     use concat to add intercept term to avoid side effect
#     not efficient for big dataset though

#     """
    ones = pd.DataFrame({'ones': np.ones(len(df))}) # ones是m行1列的dataframe
    data = pd.concat([ones, df], axis=1)  # 合并数据，根据列合并
    return data.iloc[:, :-1].as_matrix()  # 这个 *** 作获取所有的特征列，返回 ndarray,不是矩阵

2.3其他准备

需要定义一个获取标签（即最后一列，或者说是回归值）的函数：

# 读取标签
def get_y(df):
#    """
#    assume the last column is the target
#
#    """
    return np.array(df.iloc[:, -1]) # df.iloc[:, -1]是指df的最后一列

2.4线性回归主体

使用上面定义的函数分别获得特征和标签：

X = get_X(df)
print(X.shape, type(X)) # 看下数据维度

y = get_y(df)
print(y.shape, type(y))
# 结果如下
# (97, 2) 
# (97,)

构造参数向量 θ \theta θ：

# 由线性回归假设函数可知，参数向量的维数是原始数据集的特征数+截距项的特征数
# 在本示例单变量线性回归中，参数向量维数就是1+1=2
theta = np.zeros(X.shape[1]) # X.shape[1]=2,代表特征数n
print(theta)
# 结果如下
# [ 0.  0.]

2.4.1计算代价函数

单变量线性回归的代价函数 J ( θ ) J(\theta) J(θ)的计算公式如式(4)所示：

J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 (4) J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m{({h_\theta(x^{(i)})-y^{(i)}})^2} \tag{4} J(θ)=2m1i=1∑m(hθ(x(i))−y(i))2(4)

其中： h θ ( x ) = θ T X = θ 0 x 0 + θ 1 x 1 h_\theta(x)=\theta^TX=\theta_0x_0+\theta_1x_1 hθ(x)=θTX=θ0x0+θ1x1。

则根据代价函数公式可以定义计算线性回归的代价函数：

# 定义代价函数
def lr_cost(theta, X: numpy.ndarray, y: numpy.ndarray):
    '''
    :param theta: 维度是R(n)，是线性回归的参数
    :param X: 维度是R(m*n)，m为样本数，n为特征数
    :param y:维度是R(m)
    :return:
    '''
    m = X.shape[0]  # m为样本数
    # 计算每个样本的每个特征与对应参数的乘积
    inner = X.dot(theta) - y  # X.dot(theta)等价于np.dot(X,theta)，inner的维度是R(m*1)
    # 计算代价函数里的平方，然后求和，需要注意：
    # 1*m @ m*1 = 1*1 in matrix multiplication
    # but you know numpy didn't do transpose in 1d array, so here is just a
    # vector inner product to itselves
    square_sum = np.dot(inner.T, inner)  # square_sum维度是R(1*1)
    cost = square_sum / (2 * m)
    return cost

然后用该函数试一试初始的参数对应的代价是多少：

lr_cost(theta, X, y) # 试一试初始的参数对应的代价是多少
# 结果如下
# 32.072733877455669

整个计算过程中维度的直观变化如式(5)和式(6)所示：

i n n e r ( m , 1 ) = X ( m , n + 1 ) . d o t ( θ ( n + 1 , 1 ) ) − y ( m , 1 ) (5) inner_{(m,1)}=X_{(m,n+1)}.dot(\theta_{(n+1,1)})-y_{(m,1)} \tag{5} inner(m,1)=X(m,n+1).dot(θ(n+1,1))−y(m,1)(5)

s q u a r e _ n u m ( 1 , 1 ) = ( i n n e r . T ) ( 1 , m ) . d o t ( i n n e r ( m , 1 ) ) (6) square\_num_{(1,1)}=(inner.T)_{(1,m)}.dot(inner_{(m,1)}) \tag{6} square_num(1,1)=(inner.T)(1,m).dot(inner(m,1))(6)

2.4.2梯度下降+拟合

多元线性回归的梯度下降更新公式如式(7)所示：

θ j = θ j − α ∂ ∂ θ j J ( θ ) (7) \theta_j=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) \tag{7} θj=θj−α∂θj∂J(θ)(7)

上式经推到之后得到实际可 *** 作的式(8)：

θ j = θ j − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ) (8) \theta_j=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j) \tag{8} θj=θj−αm1i=1∑m((hθ(x(i))−y(i))xj(i))(8)

先定义函数来计算式(8)的求和部分：

# 先定义函数来计算梯度下降更新公式中的求和部分
def gradient(theta, X, y):
    '''
    :param theta: 维度是R(n)，是线性回归的参数
    :param X: 维度是R(m*n)，m为样本数，n为特征数
    :param y: 维度是R(m)
    :return:维度是R(n+1,1)，即与参数向量theta同维度
    '''
    m = X.shape[0]
    inner = np.dot(X.T, (np.dot(X, theta) - y))
    return inner / m

整个计算过程中维度的变化如式所示：

i n n e r ( n + 1 , 1 ) = ( X ( m , n + 1 ) ) T . d o t ( ( X ( m , n + 1 ) . d o t ( θ ( n + 1 , 1 ) ) − y ( m , 1 ) ) ) (9) inner_{(n+1,1)}=(X_{(m,n+1)})^T.dot((X_{(m,n+1)}.dot(\theta_{(n+1,1)})-y_{(m,1)})) \tag{9} inner(n+1,1)=(X(m,n+1))T.dot((X(m,n+1).dot(θ(n+1,1))−y(m,1)))(9)

该过程实际的意义（即为什么这样写就可以了，需要好好理解，需要稍微想想，但是也不是太难）
- 首先， h θ ( x ( i ) ) − y ( i ) h_\theta(x^{(i)})-y^{(i)} hθ(x(i))−y(i)这一部分，不论是计算 θ \theta θ向量中的哪一个元素，均需要将所有的样本都纳入进来，而矩阵 X X X的每一行就是一个样本，按照矩阵的乘法，每一行都要与 θ \theta θ向量对应元素相乘。这样通过矩阵的乘法就一次性完成各样本与参数的乘积。之后再与每个样本的标签向量 y y y相减，就得到了差值向量（n+1维），其中每个元素对应每个样本的预测值与实际值的差值。
- 那在更新 θ j \theta_j θj时如何使用对应的 x j x_j xj呢。 X T X^T XT之后，第一行代表所有样本的第一个特征，依此类推。使用 X T X^T XT的第一行元素与差值向量相乘，即得到第一个特征的所有样本与对应的差值向量中的元素的乘积之和，而这第一个特征就对应参数向量的第一个元素。整个 X T X^T XT与插值向量相乘之后，就得到了参数向量每个元素所谓的“梯度”（由于矩阵运算，求和自动完成了）

接着定义完整的梯度下降过程，并对参数进行拟合(仅仅通过设置迭代轮数来拟合)：

# 批量梯度下降函数
def batch_gradient_decent(theta, X, y, epoch, alpha=0.01):
    '''
    :param theta: 维度是R(n)，是线性回归的参数
    :param X: 维度是R(m*n)，m为样本数，n为特征数
    :param y: 维度是R(m)
    :param epoch: 批处理的轮数
    :param alpha: 学习率，即梯度下降更新公式里的alpha
    :return: 拟合线性回归,返回参数和代价
    '''
    cost_data = [lr_cost(theta, X, y)]
    _theta = theta.copy()  # 拷贝一份，不和原来的theta混淆

    for _ in range(epoch):
        _theta = _theta - alpha * gradient(_theta, X, y)
        cost_data.append(lr_cost(_theta, X, y))

    return _theta, cost_data

2.4.3实际调用并拟合线性回归

使用实际的样本数据集来拟合单变量线性回归函数：

epoch = 500
final_theta, cost_data = batch_gradient_decent(theta, X, y, epoch)
print(final_theta)
# 结果如下
# [-2.28286727  1.03099898]

经过拟合之后，可以观察一下代价的变化过程，可以看到迭代5轮之后代价函数逐渐趋于稳定。

ax = sns.lineplot(cost_data, y=np.arange(epoch+1))
ax.set_xlabel('epoch')
ax.set_ylabel('cost')
plt.show()

使用拟合的参数画出拟合的曲线，如下图所示：

# 观察最终的拟合曲线
b = final_theta[0] # intercept，Y轴上的截距
m = final_theta[1] # slope，斜率

plt.scatter(df.population, df.profit, label="Training data")
plt.plot(df.population, df.population*m + b, label="Prediction")
plt.legend(loc=2)
plt.show()