python机器学习从入门到高级：超参数调整（含详细代码）_python

Python机器学习之超参数调整

🌸个人主页：JoJo的数据分析历险记
📝个人介绍：小编大四统计在读，目前保研到统计学top3高校继续攻读统计研究生
💌如果文章对你有帮助，欢迎✌关注、👍点赞、✌收藏、👍订阅专栏

文章目录

Python机器学习之超参数调整
💮1 使用GridSearchCV
🍁2.使用随机搜索选择模型
🏵️3.从多种学习算法中选择最佳模型

在我们选择好一个模型后，接下来要做的是如何提高模型的精度。因此需要进行超参数调整，一种方法是手动调整超参数，直到找到超参数值的最佳组合。这将是一个非常复杂的工作，我们可以通过sklearn中的一些方法来进行搜索。我们所需要做的就是告诉它我们想用哪些超参数进行实验，以及尝试哪些值，然后它将使用交叉验证来评估所有可能的超参数值组合。

💮1 使用GridSearchCV

这种方法就是通过不断搜索匹配选出最好的超参数

具体代码如下

# 导入所需库
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

# 加载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target

# 创建模型
logistic = linear_model.LogisticRegression()

logistic回归有两个参数，一个是正则化惩罚的方式L1,L2
还有一个是正则化系数C

penalty = ['l1', 'l2']

C = np.logspace(0, 4, 10)

hyperparameters = dict(C=C, penalty=penalty)

# 创建网格搜索对象
gridsearch = GridSearchCV(logistic, hyperparameters, cv=5)

默认情况下，在找到最佳超参数之后，GridSearchCV将使用最佳超参数和整个数据集重新训练模型

best_model = gridsearch.fit(features, target)

下面我们来看一下最优的具体超参数

best_model.best_estimator_.get_params()

{'C': 7.742636826811269,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

正则化系数取C:7.74,惩罚项选择L2正则化

best_model.predict(features)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

🍁2.使用随机搜索选择模型

当您探索相对较少的组合时，网格搜索方法很好，如前一个示例中所示，但当超参数搜索空间较大时，通常最好使用randomizedsearchcv。该类的使用方式与GridSearchCVclass大致相同，但它不是尝试所有可能的组合，而是评估给定的通过在每次迭代中为每个HyperParameter选择一个随机值来计算随机组合的数量。这种方法有两个主要好处

如果让随机搜索运行1000次迭代，这种方法将为每个超参数探索1000个不同的值（而不是网格搜索方法中每个超参数只有几个值）。
只需设置迭代次数，就可以更好地控制要分配给hyperparametersearch的计算预算

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

#c来自一个均匀分布
c = uniform(loc=0, scale=4)

hyperparameters = dict(C=c, penalty=penalty)

randomizedsearchCV = RandomizedSearchCV(logistic, hyperparameters, random_state=1, n_iter=100, cv=5)

best_model = randomizedsearchCV.fit(features, target)

best_model.best_estimator_.get_params()

{'C': 1.668088018810296,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'warn',
 'n_jobs': None,
 'penalty': 'l1',
 'random_state': None,
 'solver': 'warn',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

可以看到此时最优超参数为C：1.67， 正则化方式选L1

🏵️3.从多种学习算法中选择最佳模型

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

np.random.seed(10)

iris = datasets.load_iris()

search_space = [{'classifier':[LogisticRegression()],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)},
                {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators':[10, 100, 1000],
                 'classifier__max_features':[1, 2, 3]}]

gridsearch = GridSearchCV(pip, search_space, cv=5)

best_model = gridsearch.fit(features, target)

best_model.best_estimator_.get_params()

{'memory': None,
 'steps': [('classifier', LogisticRegression(C=7.742636826811269))],
 'verbose': False,
 'classifier': LogisticRegression(C=7.742636826811269),
 'classifier__C': 7.742636826811269,
 'classifier__class_weight': None,
 'classifier__dual': False,
 'classifier__fit_intercept': True,
 'classifier__intercept_scaling': 1,
 'classifier__l1_ratio': None,
 'classifier__max_iter': 100,
 'classifier__multi_class': 'auto',
 'classifier__n_jobs': None,
 'classifier__penalty': 'l2',
 'classifier__random_state': None,
 'classifier__solver': 'lbfgs',
 'classifier__tol': 0.0001,
 'classifier__verbose': 0,
 'classifier__warm_start': False}

对于该数据集，上述结果表明使用logistic回归的效果比随机森林更好

本章的介绍到此介绍，如果文章对你有帮助，请多多点赞、收藏、评论、关注支持！！

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/869481.html

python机器学习从入门到高级：超参数调整（含详细代码）

发表评论

评论列表（0条）