二项逻辑斯蒂回归(逻辑回归)

二项逻辑斯蒂回归(逻辑回归),第1张

二项逻辑斯蒂回归(逻辑回归) 二项逻辑斯蒂回归 1 定义

二项逻辑斯蒂回归模型是如下的条件概率分布:
P ( Y = 1 ∣ x ) = exp ⁡ ( ω ⋅ x + b ) 1 + exp ⁡ ( ω ⋅ x + b ) P ( Y = 0 ∣ x ) = 1 1 + exp ⁡ ( ω ⋅ x + b ) P(Y=1 | x)=frac{exp (omega cdot x+b)}{1+exp (omega cdot x+b)} \[4ex] P(Y=0 | x)=frac{1}{1+exp (omega cdot x+b)} P(Y=1∣x)=1+exp(ω⋅x+b)exp(ω⋅x+b)​P(Y=0∣x)=1+exp(ω⋅x+b)1​
这里, x ∈ R n xin R^n x∈Rn 是输入, Υ ∈ { 0 , 1 } Upsiloninleft{0,1right} Υ∈{0,1} 是输出, ω ∈ R n omegain R^n ω∈Rn 和 b ∈ R bin R b∈R 是参数, ω omega ω 称为权值向量, b b b 称为偏置, ω ⋅ x omegacdot x ω⋅x 为 ω omega ω 和 x x x 的内积。

可以将权值向量和输入向量加以扩充,仍记作 ω , x omega, x ω,x ,即
ω = ( ω ( 1 ) ω ( 2 ) ω ( 3 ) ⋯ ω ( n ) b ) x = ( x ( 1 ) x ( 2 ) x ( 3 ) ⋯ x ( n ) 1 ) begin{array}{l} omega = begin{pmatrix} omega^{(1)}&omega^{(2)}&omega^{(3)}&cdots&omega^{(n)}&b end{pmatrix} \[2ex] x = begin{pmatrix} x^{(1)}&x^{(2)}&x^{(3)}&cdots&x^{(n)}&1 end{pmatrix} end{array} ω=(ω(1)​ω(2)​ω(3)​⋯​ω(n)​b​)x=(x(1)​x(2)​x(3)​⋯​x(n)​1​)​
这时,逻辑斯蒂回归模型如下:
P ( Y = 1 ∣ x ) = exp ⁡ ( ω ⋅ x ) 1 + exp ⁡ ( ω ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ⁡ ( ω ⋅ x ) P(Y=1 | x)=frac{exp (omega cdot x)}{1+exp (omega cdot x)} \[4ex] P(Y=0 | x)=frac{1}{1+exp (omega cdot x)} P(Y=1∣x)=1+exp(ω⋅x)exp(ω⋅x)​P(Y=0∣x)=1+exp(ω⋅x)1​

2 模型参数估计

应用极大似然估计法估计模型参数,从而得到逻辑斯蒂回归模型。

对数似然函数为
L ( ω ) = ∑ i = 1 N [ y i ( ω ⋅ x i ) − log ⁡ ( 1 + exp ⁡ ( ω ⋅ x i ) ) ] L(omega)=sum_{i=1}^N left[y_i (omega cdot x_i)-log left(1+exp (omega cdot x_i)right)right] L(ω)=i=1∑N​[yi​(ω⋅xi​)−log(1+exp(ω⋅xi​))]
从理论上来讲,直接求 L ( ω ) L(omega) L(ω) 的极大值,就能得到 ω omega ω 的估计值。但在实际的数据场景中,我们常采用梯度下降法进行求解。

L ( ω ) L(omega) L(ω) 对 ω omega ω 的每一个元素求偏导,
∂ L ( ω ) ∂ ω ( j ) = ∑ i = 1 N [ x i ( j ) y i − exp ⁡ ( ω ⋅ x i ) x i ( j ) 1 + exp ⁡ ( ω ⋅ x i ) ] frac{partial L(omega)}{partial omega^{(j)}} = sum_{i=1}^Nleft[x_i^{(j)} y_i-frac{exp (omega cdot x_i) x_i^{(j)}}{1+exp (omega cdot x_i)}right] ∂ω(j)∂L(ω)​=i=1∑N​[xi(j)​yi​−1+exp(ω⋅xi​)exp(ω⋅xi​)xi(j)​​]
再由所有偏导组成向量,得到的就是梯度。

3 学习算法1

输入:训练数据集 T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } T=left{left(x_{1}, y_{1}right),left(x_{2}, y_{2}right), cdots,left(x_{N}, y_{N}right)right} T={(x1​,y1​),(x2​,y2​),⋯,(xN​,yN​)},其中 x i ∈ χ = R n x_{i}inchi=R^n xi​∈χ=Rn, y i ∈ Υ = { − 1 , + 1 } y_{i}inUpsilon=left{-1,+1right} yi​∈Υ={−1,+1}, i = 1 , 2 , ⋯   , N i=1,2,cdots,N i=1,2,⋯,N;学习率 η ( 0 < η ≤ 1 ) etaleft(0

输出: ω omega ω;

  1. 选取初始值 ω 0 omega_{0} ω0​;

  2. 在训练集中选取数据 ( x i , y i ) left(x_{i},y_{i}right) (xi​,yi​);

  3. 如果 ∃ j ∈ [ 1 , n ] exists j in [1,n] ∃j∈[1,n] ,使得:
    [ x i ( j ) y i − exp ⁡ ( ω ⋅ x i ) x i ( j ) 1 + exp ⁡ ( ω ⋅ x i ) ] > δ left[ x_i^{(j)} y_i-frac{exp (omega cdot x_i) x_i^{(j)}}{1+exp (omega cdot x_i)} right] gt delta [xi(j)​yi​−1+exp(ω⋅xi​)exp(ω⋅xi​)xi(j)​​]>δ
    则:
    ω ← ω + η [ x i ( j ) y i − exp ⁡ ( ω ⋅ x i ) x i ( j ) 1 + exp ⁡ ( ω ⋅ x i ) ] omegaleftarrow omega+eta left[ x_i^{(j)} y_i-frac{exp (omega cdot x_i) x_i^{(j)}}{1+exp (omega cdot x_i)} right] ω←ω+η[xi(j)​yi​−1+exp(ω⋅xi​)exp(ω⋅xi​)xi(j)​​]

  4. 转至2,直到对于 ∀ j ∈ [ 1 , n ] forall j in [1,n] ∀j∈[1,n] ,都有
    [ x i ( j ) y i − exp ⁡ ( ω ⋅ x i ) x i ( j ) 1 + exp ⁡ ( ω ⋅ x i ) ] ≤ δ left[ x_i^{(j)} y_i-frac{exp (omega cdot x_i) x_i^{(j)}}{1+exp (omega cdot x_i)} right] leq delta [xi(j)​yi​−1+exp(ω⋅xi​)exp(ω⋅xi​)xi(j)​​]≤δ

4 代码实现
  1. 了解数据

    • 输入向量 χ chi χ 和 Υ Upsilon Υ
      χ = ( x 1 ( 1 ) x 1 ( 2 ) x 1 ( 3 ) ⋯ x 1 ( n ) x 2 ( 1 ) x 2 ( 2 ) x 2 ( 3 ) ⋯ x 2 ( n ) ⋮ ⋮ ⋮ ⋱ ⋮ x N ( 1 ) x N ( 2 ) x N ( 3 ) ⋯ x N ( n ) ) Υ = ( y 1 y 2 ⋮ y N ) begin{array}{l} chi = begin{pmatrix} x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&cdots&x_1^{(n)}\[2ex] x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&cdots&x_2^{(n)}\[2ex] vdots&vdots&vdots&ddots&vdots\[2ex] x_N^{(1)}&x_N^{(2)}&x_N^{(3)}&cdots&x_N^{(n)}\[2ex] end{pmatrix} Upsilon = begin{pmatrix} y_1\[2ex] y_2\[2ex] vdots\[2ex] y_N\[2ex] end{pmatrix} end{array} χ=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​x1(1)​x2(1)​⋮xN(1)​​x1(2)​x2(2)​⋮xN(2)​​x1(3)​x2(3)​⋮xN(3)​​⋯⋯⋱⋯​x1(n)​x2(n)​⋮xN(n)​​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​Υ=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​y1​y2​⋮yN​​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​​
    • 扩充后的输入向量
      χ = ( x 1 ( 1 ) x 1 ( 2 ) x 1 ( 3 ) ⋯ x 1 ( n ) 1 x 2 ( 1 ) x 2 ( 2 ) x 2 ( 3 ) ⋯ x 2 ( n ) 1 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ x N ( 1 ) x N ( 2 ) x N ( 3 ) ⋯ x N ( n ) 1 ) chi = begin{pmatrix} x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&cdots&x_1^{(n)}&1\[2ex] x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&cdots&x_2^{(n)}&1\[2ex] vdots&vdots&vdots&ddots&vdots&vdots\[2ex] x_N^{(1)}&x_N^{(2)}&x_N^{(3)}&cdots&x_N^{(n)}&1\[2ex] end{pmatrix} χ=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​x1(1)​x2(1)​⋮xN(1)​​x1(2)​x2(2)​⋮xN(2)​​x1(3)​x2(3)​⋮xN(3)​​⋯⋯⋱⋯​x1(n)​x2(n)​⋮xN(n)​​11⋮1​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​
      对应的代码为:
    def preprocessing(X):
        X_plus = np.ones(X.shape[0]).reshape(-1, 1)
        X_new = np.hstack([X, X_plus])
        return X_new
    X = preprocessing(X)
    
  2. 初始化参数。

    • 权值向量 ω omega ω 是用来和样本 x i x_{i} xi​ 求内积的一个向量,对应于扩充后的输入向量:

      ω = ( ω ( 1 ) ω ( 2 ) ω ( 3 ) ⋯ ω ( n ) b ) = ( 0 0 0 ⋯ 0 ) omega = begin{pmatrix} omega^{(1)}&omega^{(2)}&omega^{(3)}&cdots&omega^{(n)}&b end{pmatrix} = begin{pmatrix} 0&0&0&cdots&0\ end{pmatrix} ω=(ω(1)​ω(2)​ω(3)​⋯​ω(n)​b​)=(0​0​0​⋯​0​)
      对应的代码为:

      w = np.zeros(X.shape[1]).reshape(1, -1)
      
    • 超参数 η eta η ,赋予默认值 1
      η = 1 eta = 1 η=1

      eta = 1
      
    • 梯度更新的阈值 δ delta δ ,赋予默认值 0.01

      delta = 1e-2
      
  3. 计算梯度

    • 定义
      ∇ L ( ω ) = ( ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ∂ L ( ω ) ∂ ω ( n + 1 ) ) displaystyle nabla L(omega) = left( frac{partial L(omega)}{partial omega^{(1)}} quad cdots quad frac{partial L(omega)}{partial omega^{(n+1)}} right) ∇L(ω)=(∂ω(1)∂L(ω)​⋯∂ω(n+1)∂L(ω)​)

    • 公式解析
      ∂ L ( ω ) ∂ ω ( j ) = ∑ i = 1 N [ x i ( j ) y i − exp ⁡ ( ω ⋅ x i ) x i ( j ) 1 + exp ⁡ ( ω ⋅ x i ) ] = ∑ i = 1 N [ x i ( j ) ( y i − exp ⁡ ( ω ⋅ x i ) 1 + exp ⁡ ( ω ⋅ x i ) ) ] frac{partial L(omega)}{partial omega^{(j)}} = sum_{i=1}^Nleft[x_i^{(j)} y_i-frac{exp (omega cdot x_i) x_i^{(j)}}{1+exp (omega cdot x_i)}right] = sum_{i=1}^Nleft[x_i^{(j)} left(y_i-frac{exp (omega cdot x_i)}{1+exp (omega cdot x_i)}right)right] ∂ω(j)∂L(ω)​=i=1∑N​[xi(j)​yi​−1+exp(ω⋅xi​)exp(ω⋅xi​)xi(j)​​]=i=1∑N​[xi(j)​(yi​−1+exp(ω⋅xi​)exp(ω⋅xi​)​)]
      等式左边,是对 ω omega ω 的第 j j j 个元素求偏导,得到的就是梯度第 j j j 个元素的值。

      等式右边是一个求和公式。先看求和符号, i ∈ [ 1 , N ] iin[1,N] i∈[1,N] 表示所有的样本;再来看求和的具体内容, x i x_{i} xi​ 是第 i i i 个样本的特征向量, x i ( j ) x_{i}^{(j)} xi(j)​ 第 i i i 个样本的第 j j j 个特征,即特征向量的第 j j j 个元素,二者有如下关系:
      x i = ( x i ( 1 ) x i ( 2 ) ⋯ x i ( j ) ⋯ x i ( n ) 1 ) x_{i} = begin{pmatrix} x_{i}^{(1)}&x_i^{(2)}&cdots&x_i^{(j)}&cdots&x_i^{(n)}&1 \ end{pmatrix} xi​=(xi(1)​​xi(2)​​⋯​xi(j)​​⋯​xi(n)​​1​)
      所以等式右边表示的是:所有样本的、第 j j j 个特征的、计算值的求和。

    • 公式计算过程

      由内向外看,对于样本 x i x_{i} xi​ ,最内部的内积,在这里是一个数值:
      z i = ω ⋅ x i = ( ω ( 1 ) ω ( 2 ) ω ( 3 ) ⋯ ω ( n ) b ) ( x i ( 1 ) x i ( 2 ) ⋯ x i ( j ) ⋯ x i ( n ) 1 ) = ω ( 1 ) x i ( 1 ) + ω ( 2 ) x i ( 2 ) + ⋯ + ω ( n ) x i ( n ) + b z_{i} = omega cdot x_{i} = begin{pmatrix} omega^{(1)}&omega^{(2)}&omega^{(3)}&cdots&omega^{(n)}&b end{pmatrix} begin{pmatrix} x_{i}^{(1)} \ x_i^{(2)} \ cdots \ x_i^{(j)}\ cdots \ x_i^{(n)} \ 1 end{pmatrix} = omega^{(1)}x_{i}^{(1)} + omega^{(2)}x_{i}^{(2)} + cdots + omega^{(n)}x_{i}^{(n)} + b zi​=ω⋅xi​=(ω(1)​ω(2)​ω(3)​⋯​ω(n)​b​)⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​xi(1)​xi(2)​⋯xi(j)​⋯xi(n)​1​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​=ω(1)xi(1)​+ω(2)xi(2)​+⋯+ω(n)xi(n)​+b
      紧接着点乘的,是一个Sigmoid函数,变换后,仍旧是一个数值:
      z ^ i = exp ⁡ ( z i ) 1 + exp ⁡ ( z i ) hat z_{i} = frac{exp (z_{i})}{1+exp (z_{i})} z^i​=1+exp(zi​)exp(zi​)​

      紧接着,是与 y i y_{i} yi​ 做差,再与 x i ( j ) x_{i}^{(j)} xi(j)​ 相差,这两个变量都是数值,所以第 i i i 个样本、第 j j j 个特征最终的计算值为:
      x i ( j ) ( y i − z ^ i ) x_{i}^{(j)} left(y_{i} - hat z_{i}right) xi(j)​(yi​−z^i​)
      最后,把所有样本的上述计算值求和,就是对 ω omega ω 的第 j j j 个元素求偏导的结果。

    • 公式变形

      教材中给出的上述公式,每次只能计算梯度中的一个值,如果想一次性计算出整个梯度的值,要如何做呢?

      将上述计算过程中的 x i x_{i} xi​ 变成 χ chi χ ,则 z z z 的值为:
      z = χ ⋅ ω = ( x 1 ( 1 ) x 1 ( 2 ) x 1 ( 3 ) ⋯ x 1 ( n ) 1 x 2 ( 1 ) x 2 ( 2 ) x 2 ( 3 ) ⋯ x 2 ( n ) 1 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ x N ( 1 ) x N ( 2 ) x N ( 3 ) ⋯ x N ( n ) 1 ) ( ω ( 1 ) ω ( 2 ) ⋮ ω ( n ) b ) = ( ω ( 1 ) x 1 ( 1 ) + ω ( 2 ) x 1 ( 2 ) + ⋯ + ω ( n ) x 1 ( n ) + b ω ( 1 ) x 2 ( 1 ) + ω ( 2 ) x 2 ( 2 ) + ⋯ + ω ( n ) x 2 ( n ) + b ⋮ ω ( 1 ) x N ( 1 ) + ω ( 2 ) x N ( 2 ) + ⋯ + ω ( n ) x N ( n ) + b ) = ( x 1 ⋅ ω x 2 ⋅ ω ⋮ x i ⋅ ω ⋮ x N ⋅ ω ) = ( z 1 z 2 ⋮ z i ⋮ z N ) z = chi cdot omega = begin{pmatrix} x_1^{(1)}&x_1^{(2)}&x_1^{(3)}&cdots&x_1^{(n)}&1\[2ex] x_2^{(1)}&x_2^{(2)}&x_2^{(3)}&cdots&x_2^{(n)}&1\[2ex] vdots&vdots&vdots&ddots&vdots&vdots\[2ex] x_N^{(1)}&x_N^{(2)}&x_N^{(3)}&cdots&x_N^{(n)}&1\[2ex] end{pmatrix} begin{pmatrix} omega^{(1)} \[2ex] omega^{(2)} \[2ex] vdots \[2ex] omega^{(n)} \[2ex] b end{pmatrix} = begin{pmatrix} omega^{(1)}x_{1}^{(1)} + omega^{(2)}x_{1}^{(2)} + cdots + omega^{(n)}x_{1}^{(n)} + b \[2ex] omega^{(1)}x_{2}^{(1)} + omega^{(2)}x_{2}^{(2)} + cdots + omega^{(n)}x_{2}^{(n)} + b \[2ex] vdots \[2ex] omega^{(1)}x_{N}^{(1)} + omega^{(2)}x_{N}^{(2)} + cdots + omega^{(n)}x_{N}^{(n)} + b end{pmatrix} = begin{pmatrix} x_1 cdot omega\[2ex] x_2 cdot omega\[2ex] vdots\[2ex] x_i cdot omega\[2ex] vdots\[2ex] x_N cdot omega\[2ex] end{pmatrix} = begin{pmatrix} z_1\[2ex] z_2\[2ex] vdots\[2ex] z_i\[2ex] vdots\[2ex] z_N\[2ex] end{pmatrix} z=χ⋅ω=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​x1(1)​x2(1)​⋮xN(1)​​x1(2)​x2(2)​⋮xN(2)​​x1(3)​x2(3)​⋮xN(3)​​⋯⋯⋱⋯​x1(n)​x2(n)​⋮xN(n)​​11⋮1​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​ω(1)ω(2)⋮ω(n)b​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎛​ω(1)x1(1)​+ω(2)x1(2)​+⋯+ω(n)x1(n)​+bω(1)x2(1)​+ω(2)x2(2)​+⋯+ω(n)x2(n)​+b⋮ω(1)xN(1)​+ω(2)xN(2)​+⋯+ω(n)xN(n)​+b​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎞​=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​x1​⋅ωx2​⋅ω⋮xi​⋅ω⋮xN​⋅ω​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​z1​z2​⋮zi​⋮zN​​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​

      z = np.dot(X, w.T)  # 将点乘转换为矩阵乘法
      

      使用Sigmoid函数对 z z z 进行变换:

      z ^ = exp ⁡ ( z ) 1 + exp ⁡ ( z ) hat z = frac{exp (z)}{1+exp (z)} z^=1+exp(z)exp(z)​

      def sigmoid(x):
          return 1/(1 + np.exp(-x))
      z = sigmoid(z)
      

      和 Υ Upsilon Υ 做差:

      Υ − z ^ = ( y 1 − z ^ 1 y 2 − z ^ 2 ⋮ y i − z ^ i ⋮ y N − z ^ N ) Upsilon - hat z = begin{pmatrix} y_{1} - hat z_1\[2ex] y_{2} - hat z_2\[2ex] vdots\[2ex] y_{i} - hat z_i\[2ex] vdots\[2ex] y_{N} - hat z_N\[2ex] end{pmatrix} Υ−z^=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​y1​−z^1​y2​−z^2​⋮yi​−z^i​⋮yN​−z^N​​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​

      y - z
      

      再和 χ chi χ 相乘:
      ( x 1 ( 1 ) ⋯ x 1 ( j ) ⋯ x 1 ( n ) 1 x 2 ( 1 ) ⋯ x 2 ( j ) ⋯ x 2 ( n ) 1 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ x i ( 1 ) ⋯ x i ( j ) ⋯ x i ( n ) 1 ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ x N ( 1 ) ⋯ x N ( j ) ⋯ x N ( n ) 1 ) × ( y 1 − z ^ 1 y 2 − z ^ 2 ⋮ y i − z ^ i ⋮ y N − z ^ N ) = ( x 1 ( 1 ) ( y 1 − z ^ 1 ) ⋯ x 1 ( j ) ( y 1 − z ^ 1 ) ⋯ x 1 ( n ) ( y 1 − z ^ 1 ) ( y 1 − z ^ 1 ) x 2 ( 1 ) ( y 2 − z ^ 2 ) ⋯ x 2 ( j ) ( y 2 − z ^ 2 ) ⋯ x 2 ( n ) ( y 2 − z ^ 2 ) ( y 2 − z ^ 2 ) ⋮ ⋯ ⋮ ⋱ ⋮ ⋮ x i ( 1 ) ( y i − z ^ i ) ⋯ x i ( j ) ( y i − z ^ i ) ⋯ x i ( n ) ( y i − z ^ i ) ( y i − z ^ i ) ⋮ ⋯ ⋮ ⋱ ⋮ ⋮ x N ( 1 ) ( y N − z ^ N ) ⋯ x N ( j ) ( y N − z ^ N ) ⋯ x N ( n ) ( y N − z ^ N ) ( y N − z ^ N ) ) begin{pmatrix} x_1^{(1)}&cdots&x_1^{(j)}&cdots&x_1^{(n)}&1\[2ex] x_2^{(1)}&cdots&x_2^{(j)}&cdots&x_2^{(n)}&1\[2ex] vdots&vdots&ddots&vdots&vdots&vdots\[2ex] x_i^{(1)}&cdots&x_i^{(j)}&cdots&x_i^{(n)}&1\[2ex] vdots&vdots&ddots&vdots&vdots&vdots\[2ex] x_N^{(1)}&cdots&x_N^{(j)}&cdots&x_N^{(n)}&1\[2ex] end{pmatrix} times begin{pmatrix} y_{1} - hat z_1\[2ex] y_{2} - hat z_2\[2ex] vdots\[2ex] y_{i} - hat z_i\[2ex] vdots\[2ex] y_{N} - hat z_N\[2ex] end{pmatrix} = begin{pmatrix} x_1^{(1)}(y_{1}-hat z_1)&cdots&x_1^{(j)}(y_{1}-hat z_1)&cdots&x_1^{(n)}(y_{1}-hat z_1)&(y_{1}-hat z_1)\[2ex] x_2^{(1)}(y_{2}-hat z_2)&cdots&x_2^{(j)}(y_{2}-hat z_2)&cdots&x_2^{(n)}(y_{2}-hat z_2)&(y_{2}-hat z_2)\[2ex] vdots&cdots&vdots&ddots&vdots&vdots\[2ex] x_i^{(1)}(y_{i}-hat z_i)&cdots&x_i^{(j)}(y_{i}-hat z_i)&cdots&x_i^{(n)}(y_{i}-hat z_i)&(y_{i}-hat z_i)\[2ex] vdots&cdots&vdots&ddots&vdots&vdots\[2ex] x_N^{(1)}(y_{N}-hat z_N)&cdots&x_N^{(j)}(y_{N}-hat z_N)&cdots&x_N^{(n)}(y_{N}-hat z_N)&(y_{N}-hat z_N)\[2ex] end{pmatrix} ⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​x1(1)​x2(1)​⋮xi(1)​⋮xN(1)​​⋯⋯⋮⋯⋮⋯​x1(j)​x2(j)​⋱xi(j)​⋱xN(j)​​⋯⋯⋮⋯⋮⋯​x1(n)​x2(n)​⋮xi(n)​⋮xN(n)​​11⋮1⋮1​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​×⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​y1​−z^1​y2​−z^2​⋮yi​−z^i​⋮yN​−z^N​​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​x1(1)​(y1​−z^1​)x2(1)​(y2​−z^2​)⋮xi(1)​(yi​−z^i​)⋮xN(1)​(yN​−z^N​)​⋯⋯⋯⋯⋯⋯​x1(j)​(y1​−z^1​)x2(j)​(y2​−z^2​)⋮xi(j)​(yi​−z^i​)⋮xN(j)​(yN​−z^N​)​⋯⋯⋱⋯⋱⋯​x1(n)​(y1​−z^1​)x2(n)​(y2​−z^2​)⋮xi(n)​(yi​−z^i​)⋮xN(n)​(yN​−z^N​)​(y1​−z^1​)(y2​−z^2​)⋮(yi​−z^i​)⋮(yN​−z^N​)​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​
      需要注意的是,这里既不是点乘,也不是矩阵乘法,而是将 $ Upsilon - hat z$ 中的元素,与 χ chi χ 每一列元素对应相乘。之所以要这样做,是因为我们最终想得到的就是:
      x i ( j ) ( y i − z ^ i ) x_{i}^{(j)} left(y_{i} - hat z_{i}right) xi(j)​(yi​−z^i​)
      得益于Python,我们可以直接写作:

      X * (y - z)
      

      最后,我们需要对所有样本的、第 j j j 个特征的计算值进行求和,得到的就是一个由偏导组成的向量,即梯度:
      ( ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ∂ L ( ω ) ∂ ω ( n + 1 ) ) = ( ∑ i = 1 N [ x i ( 1 ) ( y i − z ^ i ) ] ∑ i = 1 N [ x i ( 2 ) ( y i − z ^ i ) ] ⋯ ∑ i = 1 N [ x i ( n ) ( y i − z ^ i ) ] ∑ i = 1 N [ ( y i − z ^ i ) ] ) left( frac{partial L(omega)}{partial omega^{(1)}} quad cdots quad frac{partial L(omega)}{partial omega^{(n+1)}} right) = begin{pmatrix} sum_{i=1}^Nleft[x_i^{(1)}(y_{i}-hat z_i)right]&sum_{i=1}^Nleft[x_i^{(2)}(y_{i}-hat z_i)right]&cdots&sum_{i=1}^Nleft[x_i^{(n)}(y_{i}-hat z_i)right]&sum_{i=1}^Nleft[(y_{i}-hat z_i)right]\[2ex] end{pmatrix} (∂ω(1)∂L(ω)​⋯∂ω(n+1)∂L(ω)​)=(∑i=1N​[xi(1)​(yi​−z^i​)]​∑i=1N​[xi(2)​(yi​−z^i​)]​⋯​∑i=1N​[xi(n)​(yi​−z^i​)]​∑i=1N​[(yi​−z^i​)]​)
      这里的代码表示为:

      (X * (y - z)).sum(axis=0)
      

      最核心的梯度部分梳理完了,将代码整理如下:

      def sigmoid(x):
          return 1/(1 + np.exp(-x))
      z = np.dot(X, w.T)
      # 求梯度(方向导数)
      grad = (X * (y - sigmoid(z))).sum(axis=0)
      
  4. 参数更新

    根据已经计算出来的梯度,更新参数 ω omega ω :
    ω ^ = ( ω ( 1 ) ω ( 2 ) ⋯ ω ( n ) b ) + η ( ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ∂ L ( ω ) ∂ ω ( n + 1 ) ) = ( ω ( 1 ) + η ∂ L ( ω ) ∂ ω ( 1 ) ⋯ ω ( n ) + η ∂ L ( ω ) ∂ ω ( n ) b + η ∂ L ( ω ) ∂ ω ( n + 1 ) ) hat omega = begin{pmatrix} omega^{(1)}&omega^{(2)}&cdots&omega^{(n)}&b end{pmatrix} + eta left( frac{partial L(omega)}{partial omega^{(1)}} quad cdots quad frac{partial L(omega)}{partial omega^{(n+1)}} right) = left( omega^{(1)}+eta frac{partial L(omega)}{partial omega^{(1)}} quad cdots quad omega^{(n)}+eta frac{partial L(omega)}{partial omega^{(n)}} quad b+eta frac{partial L(omega)}{partial omega^{(n+1)}} right) ω^=(ω(1)​ω(2)​⋯​ω(n)​b​)+η(∂ω(1)∂L(ω)​⋯∂ω(n+1)∂L(ω)​)=(ω(1)+η∂ω(1)∂L(ω)​⋯ω(n)+η∂ω(n)∂L(ω)​b+η∂ω(n+1)∂L(ω)​)

    w += eta * grad
    
  5. 停止条件

    模型中设置了两个停止条件,只要满足任一个,模型都会停止迭代。

    • 当梯度小于给定的阈值时,意味着接下来的参数更新只能带来很少的收益,也就意味着参数达到了我们认可的一种最优状态。在这里,只有当梯度中的每一个元素都小于给定的阈值时,我们才停止更新:

      if (np.abs(grad) <= delta).all():
          print('停止迭代')
      
    • 设定最大迭代次数,是为了防止模型的过拟合。所以当迭代次数达到提前设定的最大值时,也要停止更新:

      loop = 1
      while loop <= max_iter:
          
          pass  # 模型训练过程
      
          loop += 1
      
  6. 模型预测

    求解出参数 ω omega ω 的值,根据模型的定义,我们就能预测新样本所属的标签:
    P ( Y = 1 ∣ x ) = exp ⁡ ( ω ⋅ x ) 1 + exp ⁡ ( ω ⋅ x ) P ( Y = 0 ∣ x ) = 1 1 + exp ⁡ ( ω ⋅ x ) P(Y=1 | x)=frac{exp (omega cdot x)}{1+exp (omega cdot x)} \[4ex] P(Y=0 | x)=frac{1}{1+exp (omega cdot x)} P(Y=1∣x)=1+exp(ω⋅x)exp(ω⋅x)​P(Y=0∣x)=1+exp(ω⋅x)1​

    X_test = preprocessing(X_test)
    p = sigmoid(np.dot(X_test, w.T))
    p[np.where(p>=0.5)] = 1
    p[np.where(p<0.5)] = 0
    
5 代码封装
class LogisticRegression(object):
    """
    Binomial Logistic Regression classifier.
    
    Parameters
    ----------
    eta : float, default=0.1
        Step size for each iteration.
    
    max_iter : int, default=10000
        Maximum number of iterations taken for the solvers to converge.
        
    delta : float, default=1e-2
        deltaerance for stopping criteria.
    
    method : {'BGD', 'SGD'}, default='BGD'
        The way of Gradient Descent. 
        The 'BGD' is Batch Gradient Descent. The 'SGD' is Stochastic Gradient Descent.
        
    Attributes
    ----------
    w : ndarray of shape (1, n_features)
        Coefficient of the features in the model.
        
    """
    def __init__(self, 
                 eta=0.1, 
                 max_iter=10000, 
                 delta=1e-2, 
                 method='BGD'):
        self.eta = eta
        self.max_iter = max_iter
        self.delta = delta
        self.method = method
        self.w = None
        
    # Y = 1 时的模型
    def sigmoid(self, x):
        """
        Sigmoid function.
        
        Parameters
        ----------
        x : float or ndarray of shape (n_samples, )
            The independent variable of the Sigmoid function.
            
        Returns
        -------
        y : float or ndarray of shape (n_samples, )
            Function value
        """
        return 1/(1 + np.exp(-x))
    
    def preprocessing(self, X):
        """
        Extend input vector.
        
        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Input vector.
            
        Returns
        -------
        X_new : ndarray of shape (n_samples, n_features + 1)
            Extend input vector.    
        """
        X_plus = np.ones(X.shape[0]).reshape(-1, 1)
        X_new = np.hstack([X, X_plus])
        return X_new
    
    def fit(self, X, y):
        """
        Fit the model according to the given training data.

        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Training vector, where n_samples is the number of samples and
            n_features is the number of features.

        y : ndarray of shape (n_samples, )
            Target vector relative to X.

        Returns
        -------
        self
            Fitted estimator.
        """
        # 初始化 w
        X = self.preprocessing(X)
        self.w = np.zeros(X.shape[1]).reshape(1, -1)
        
        if self.method == 'BGD':
            loop = 1
            while loop <= self.max_iter:
                # w * x
                z = np.dot(X, self.w.T)
                # 求梯度(方向导数)
                grad = (X * (y - self.sigmoid(z))).sum(axis=0)
                # 下降(上升)的幅度
                if (np.abs(grad) <= self.delta).all():
                    break
                else:
                    self.w += self.eta * grad
                loop += 1
        elif self.method == 'SGD':
            # 随机获取样本点
            index = random.randint(0, X.shape[0] - 1)
            xi = X[index]
            yi = y[index]
            
            loop = 1
            while loop <= self.max_iter:
                z = np.dot(xi, self.w.T)
                grad = xi * (yi - self.sigmoid(z))
                if (np.abs(grad) <= self.delta).all():
                    break
                else:
                    self.w += self.eta * grad
                loop += 1
        else:
            pass
    
    def predict(self, X):
        """
        Predict class labels for samples in X.

        Parameters
        ----------
        X : ndarray of shape (n_samples, n_features)
            Samples.

        Returns
        -------
        p : ndarray of shape (n_samples,)
            Predicted class label per sample.
        """
        X = self.preprocessing(X)
        p = self.sigmoid(np.dot(X, self.w.T))
        p[np.where(p>=0.5)] = 1
        p[np.where(p<0.5)] = 0
        
        return p

  1. 文章作者整理的内容,原书中并没有这一部分的表述 ↩︎

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5689745.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-17

发表评论

登录后才能评论

评论列表(0条)

保存