auto-sklearn配置及使用_python

1.安装遇到的坑

1）swig安装，macos下用brew装一下即可。

2）内存调整，autosklearn的model fit对内存要求较高，把limit调到300000，否则报错。

2.使用及探索

1）数据加载及库引入，还是用iris数据集

"""
Created on Sat Apr 16 15:26:21 2022

@author: johnny
"""
import autosklearn
from autosklearn.classification import AutoSklearnClassifier
from sklearn.metrics import accuracy_score


#数据引入
from sklearn.datasets import load_iris
data = load_iris()
x = data.data
y = data.target

from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.3,random_state=0)

2）模型配置

# define search
model = AutoSklearnClassifier(memory_limit=300000,
                              time_left_for_this_task=5*60,
                              per_run_time_limit=50, 
                              n_jobs=4,
                              
                              tmp_folder='/Users/johnny/Downloads/CreditMaster/temp/autosklearn_classification_example_tmp')
# perform the search
import time
start = time.time()
model.fit(train_x, train_y)
end = time.time()

print (str(end-start))

time_left_for_this_task意思是此任务的最长时间，并为其分配5分钟。如果没有为此参数指定任何内容，则该过程将运行一个小时。

per_run_time_limit参数将分配给每个模型评估的时间设置为 50 秒。

ensemble_size、initial_configurations_via_metalearning，可用于微调分类器。默认情况下，上述搜索命令会创建一组表现最佳的模型。为了避免过度拟合，我们可以通过更改设置ensemble_size = 1和initial_configurations_via_metalearning = 0来禁用它。

3）运行效果

运行时间：311.1847469806671

#搜索最佳性能摘要
print(model.sprint_statistics())

#模型排行榜
print(model.leaderboard())

print(model.sprint_statistics())
auto-sklearn results:
  Dataset name: 2b87dd22-be00-11ec-83d9-acde48001122
  Metric: accuracy
  Best validation score: 0.971429
  Number of target algorithm runs: 57
  Number of successful target algorithm runs: 57
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

 print(model.leaderboard())
          rank  ensemble_weight                 type      cost  duration
model_id                                                                
56           1             0.12                  lda  0.028571  6.529774
55           2             0.06                  qda  0.028571  6.576634
43           3             0.06        liblinear_svc  0.057143  4.191880
27           4             0.04                  qda  0.057143  2.138941
53           5             0.04  k_nearest_neighbors  0.085714  5.523627
52           6             0.12                  qda  0.085714  4.785708
6            7             0.02        random_forest  0.085714  6.293557
9            8             0.04           libsvm_svc  0.085714  3.043304
48           9             0.06        liblinear_svc  0.114286  4.556017
45          10             0.04        liblinear_svc  0.114286  5.430604
35          11             0.02                  lda  0.114286  0.922487
3           12             0.02        random_forest  0.114286  5.901496
23          13             0.02   passive_aggressive  0.114286  1.694792
20          14             0.02          extra_trees  0.114286  2.763696
13          15             0.04                  mlp  0.114286  4.000428
5           16             0.08   passive_aggressive  0.114286  2.807672
4           17             0.02        random_forest  0.114286  5.275651
28          18             0.04        liblinear_svc  0.114286  2.893949
34          19             0.02          extra_trees  0.114286  5.547182
49          20             0.04        liblinear_svc  0.142857  5.844000
46          21             0.08             adaboost  0.171429  5.250963

表示一共跑了57个模型，最优得分0.97。

m1_acc_score= accuracy_score(test_y, y_pred)
m1_acc_score
from sklearn.metrics import confusion_matrix, accuracy_score
import seaborn as sns
y_pred= model.predict(test_x)
conf_matrix= confusion_matrix(y_pred, test_y)
sns.heatmap(conf_matrix, annot=True)

3.结论

是一种综合集成的自动机器学习方法，用来偷懒挺合适。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/715858.html

auto-sklearn配置及使用

发表评论

评论列表（0条）