autogluon秒杀机器学习分类问题_python

1.数据预处理

autogluon竟都不需要对str类数据进行处理，为了简便只对yes/no做了一个简单处理，同时，预测的label列也不用专门摘出来，在后续训练中直接指定即可。

from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd

df1=pd.read_csv('/Users/johnny/Downloads/CreditMaster/heart_2020_cleaned.csv')

healthy = df1[df1['HeartDisease']=='No']
unhealthy = df1[df1['HeartDisease']=='Yes']
up_sampled = resample(unhealthy, replace=True, n_samples=len(healthy))
df_new = pd.concat([healthy,up_sampled])
df_new = df_new.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})
train, test = train_test_split(df_new,test_size=0.1,random_state=0)

2.模型训练

指定要预测的心脏病列尾预测label，然后调用fit即可。

label = 'HeartDisease'
predictor = TabularPredictor(label = label,presets='best_quality', ).fit(train)
TabularPredictor()
#训练过程中发生了什么
results = predictor.fit_summary()

No path specified. Models will be saved in: "AutogluonModels/ag-20220418_130233/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220418_130233/"
AutoGluon Version:  0.2.0
Train Data Rows:    526359
Train Data Columns: 17
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1, 0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    3317.47 MB
	Train Data (Original)  Memory Usage: 172.08 MB (5.2% of available memory)
	Warning: Data size prior to feature transformation consumes 5.2% of available memory. Consider increasing memory or subsampling the data to avoid instability.
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', [])  : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
		('int', [])    : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
		('object', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
	Types of features in processed data (raw dtype, special dtypes):
		('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
		('float', [])    : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
		('int', [])      : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
	2.5s = Fit runtime
	17 features in original data used to generate 17 features in processed data.
	Train Data (Processed) Memory Usage: 56.85 MB (1.7% of available memory)
Data preprocessing and feature engineering runtime = 2.87s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric argument of fit()
Automatically generating train/validation split with holdout_frac=0.01, Train Rows: 521095, Val Rows: 5264
Fitting model: KNeighborsUnif ...
	0.8575	 = Validation accuracy score
	210.89s	 = Training runtime
	0.8s	 = Validation runtime
Fitting model: KNeighborsDist ...
	0.8733	 = Validation accuracy score
	222.96s	 = Training runtime
	0.54s	 = Validation runtime
Fitting model: LightGBMXT ...
	0.762	 = Validation accuracy score
	2.54s	 = Training runtime
	0.03s	 = Validation runtime
Fitting model: LightGBM ...
[1000]	train_set's binary_error: 0.199513	valid_set's binary_error: 0.206117
[2000]	train_set's binary_error: 0.177611	valid_set's binary_error: 0.18712
[3000]	train_set's binary_error: 0.159547	valid_set's binary_error: 0.171353
[4000]	train_set's binary_error: 0.144434	valid_set's binary_error: 0.158245
[5000]	train_set's binary_error: 0.131868	valid_set's binary_error: 0.143427
[6000]	train_set's binary_error: 0.120433	valid_set's binary_error: 0.137538
[7000]	train_set's binary_error: 0.111007	valid_set's binary_error: 0.129939
[8000]	train_set's binary_error: 0.102417	valid_set's binary_error: 0.120631
[9000]	train_set's binary_error: 0.0965006	valid_set's binary_error: 0.117781
[10000]	train_set's binary_error: 0.0900143	valid_set's binary_error: 0.111512
	0.8889	 = Validation accuracy score
	200.23s	 = Training runtime
	2.84s	 = Validation runtime
Fitting model: RandomForestGini ...
	Warning: Reducing model 'n_estimators' from 300 -> 69 due to low memory. Expected memory usage reduced from 65.11% -> 15.0% of available memory...
	0.9745	 = Validation accuracy score
	33.25s	 = Training runtime
	0.24s	 = Validation runtime
Fitting model: RandomForestEntr ...
	Warning: Reducing model 'n_estimators' from 300 -> 79 due to low memory. Expected memory usage reduced from 56.4% -> 15.0% of available memory...
	0.9723	 = Validation accuracy score
	37.25s	 = Training runtime
	0.23s	 = Validation runtime
Fitting model: CatBoost ...
	0.7625	 = Validation accuracy score
	13.05s	 = Training runtime
	0.03s	 = Validation runtime
Fitting model: ExtraTreesGini ...
	Warning: Reducing model 'n_estimators' from 300 -> 55 due to low memory. Expected memory usage reduced from 80.61% -> 15.0% of available memory...
	0.9675	 = Validation accuracy score
	18.75s	 = Training runtime
	0.23s	 = Validation runtime
Fitting model: ExtraTreesEntr ...
	Warning: Reducing model 'n_estimators' from 300 -> 55 due to low memory. Expected memory usage reduced from 81.73% -> 15.0% of available memory...
	0.966	 = Validation accuracy score
	20.15s	 = Training runtime
	0.23s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
	0.7924	 = Validation accuracy score
	1075.78s	 = Training runtime
	0.19s	 = Validation runtime
Fitting model: XGBoost ...
	0.7642	 = Validation accuracy score
	21.55s	 = Training runtime
	0.09s	 = Validation runtime
Fitting model: NeuralNetMXNet ...
	0.8539	 = Validation accuracy score
	3770.0s	 = Training runtime
	0.54s	 = Validation runtime
Fitting model: LightGBMLarge ...
	0.7667	 = Validation accuracy score
	2.05s	 = Training runtime
	0.04s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	0.9753	 = Validation accuracy score
	4.05s	 = Training runtime
	0.03s	 = Validation runtime
AutoGluon training complete, total runtime = 5663.74s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220418_130233/")

fit summary摘要

results = predictor.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.975304       0.801397   260.260579                0.025225           4.047571            2       True         14
1      RandomForestGini   0.974544       0.235665    33.249032                0.235665          33.249032            1       True          5
2      RandomForestEntr   0.972264       0.234039    37.253645                0.234039          37.253645            1       True          6
3        ExtraTreesGini   0.967515       0.227900    18.753269                0.227900          18.753269            1       True          8
4        ExtraTreesEntr   0.965995       0.230008    20.152724                0.230008          20.152724            1       True          9
5              LightGBM   0.888868       2.842197   200.232388                2.842197         200.232388            1       True          4
6        KNeighborsDist   0.873290       0.540507   222.963976                0.540507         222.963976            1       True          2
7        KNeighborsUnif   0.857523       0.798026   210.889044                0.798026         210.889044            1       True          1
8        NeuralNetMXNet   0.853913       0.544325  3769.998433                0.544325        3769.998433            1       True         12
9       NeuralNetFastAI   0.792363       0.191066  1075.781896                0.191066        1075.781896            1       True         10
10        LightGBMLarge   0.766717       0.037551     2.054227                0.037551           2.054227            1       True         13
11              XGBoost   0.764248       0.086684    21.547890                0.086684          21.547890            1       True         11
12             CatBoost   0.762538       0.025575    13.046758                0.025575          13.046758            1       True          7
13           LightGBMXT   0.761968       0.034197     2.544285                0.034197           2.544285            1       True          3
Number of models trained: 14
Types of models trained:
{'XGBoostModel', 'WeightedEnsembleModel', 'RFModel', 'NNFastAiTabularModel', 'KNNModel', 'XTModel', 'LGBModel', 'CatBoostModel', 'TabularNeuralNetModel'}
Bagging used: False 
Multi-layer stack-ensembling used: False 
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
('float', [])    : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', [])      : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
*** End of fit() summary ***

3.测试效果

#测试
y_test=test[label]
test_nolab = test.drop(columns=[label])
y_pred = predictor.predict(test_nolab)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
#看一下各模型测试效果
predictor_leaderboard=predictor.leaderboard(test, silent=True)

准确率97.4····直接秒杀了


Evaluation: accuracy on test data: 0.9745917756689749
Evaluations on test data:
{
    "accuracy": 0.9745917756689749,
    "balanced_accuracy": 0.9744652085677961,
    "mcc": 0.950355347895357,
    "f1": 0.975332005312085,
    "precision": 0.9522528363047001,
    "recall": 0.9995576726777815
}

	model	score_test	score_val	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	WeightedEnsemble_L2	0.9745917756689749	0.9753039513677811	7.646605968475342	0.8013973236083984	260.2605793476105	0.25141096115112305	0.025225162506103516	4.047571182250977	2	True	14
1	RandomForestGini	0.9734974779858083	0.9745440729483282	2.6908559799194336	0.23566508293151855	33.24903202056885	2.6908559799194336	0.23566508293151855	33.24903202056885	1	True	5
2	RandomForestEntr	0.9724202787039412	0.9722644376899696	1.8741211891174316	0.2340388298034668	37.253644943237305	1.8741211891174316	0.2340388298034668	37.253644943237305	1	True	6
3	ExtraTreesGini	0.9699410105155168	0.9675151975683891	2.3864328861236572	0.22789978981018066	18.75326895713806	2.3864328861236572	0.22789978981018066	18.75326895713806	1	True	8
4	ExtraTreesEntr	0.9686073352141574	0.9659954407294833	2.554309844970703	0.23000812530517578	20.152723789215088	2.554309844970703	0.23000812530517578	20.152723789215088	1	True	9
5	LightGBM	0.887082157818244	0.8888677811550152	31.935580015182495	2.8421969413757324	200.23238825798035	31.935580015182495	2.8421969413757324	200.23238825798035	1	True	4
6	KNeighborsDist	0.8715910062409165	0.873290273556231	4.704339027404785	0.5405070781707764	222.96397614479065	4.704339027404785	0.5405070781707764	222.96397614479065	1	True	2
7	KNeighborsUnif	0.8559117722492947	0.8575227963525835	4.5526628494262695	0.7980260848999023	210.8890438079834	4.5526628494262695	0.7980260848999023	210.8890438079834	1	True	1
8	NeuralNetMXNet	0.8486449516970164	0.8539133738601824	4.86820387840271	0.5443248748779297	3769.9984328746796	4.86820387840271	0.5443248748779297	3769.9984328746796	1	True	12
9	NeuralNetFastAI	0.7926476874412243	0.7923632218844985	1.275090217590332	0.19106578826904297	1075.7818961143494	1.275090217590332	0.19106578826904297	1075.7818961143494	1	True	10
10	LightGBMLarge	0.7692741728648371	0.7667173252279635	0.08189797401428223	0.037551164627075195	2.054226875305176	0.08189797401428223	0.037551164627075195	2.054226875305176	1	True	13
11	XGBoost	0.7683679575959648	0.7642477203647416	0.5143089294433594	0.08668398857116699	21.547889709472656	0.5143089294433594	0.08668398857116699	21.547889709472656	1	True	11
12	LightGBMXT	0.7679404975634778	0.761968085106383	0.09745597839355469	0.034197092056274414	2.5442848205566406	0.09745597839355469	0.034197092056274414	2.5442848205566406	1	True	3
13	CatBoost	0.766948790288108	0.7625379939209727	0.05634880065917969	0.02557516098022461	13.046757936477661	0.05634880065917969	0.02557516098022461	13.046757936477661	1	True	7

4.欠采样下的效果

healthy = df1[df1['HeartDisease']=='No']
unhealthy = df1[df1['HeartDisease']=='Yes']
down_sample = resample(healthy,replace=True, n_samples=len(unhealthy))
#up_sampled = resample(unhealthy, replace=True, n_samples=len(healthy))
#df_new = pd.concat([healthy,up_sampled])
df_new = pd.concat([unhealthy,down_sample])
df_new = df_new.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})


train, test = train_test_split(df_new,test_size=0.1,random_state=0)

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2     0.7768       0.625114  101.197595                0.005337           1.812670            2       True         14
1       NeuralNetFastAI     0.7720       0.115848   85.281701                0.115848          85.281701            1       True         10
2              LightGBM     0.7676       0.025734    0.591603                0.025734           0.591603            1       True          4
3            LightGBMXT     0.7664       0.019911    0.341896                0.019911           0.341896            1       True          3
4              CatBoost     0.7648       0.014946    1.833666                0.014946           1.833666            1       True          7
5        NeuralNetMXNet     0.7640       0.222282  117.917249                0.222282         117.917249            1       True         12
6         LightGBMLarge     0.7632       0.028388    0.849337                0.028388           0.849337            1       True         13
7               XGBoost     0.7620       0.043253    4.062333                0.043253           4.062333            1       True         11
8      RandomForestGini     0.7600       0.327819    5.914564                0.327819           5.914564            1       True          5
9      RandomForestEntr     0.7508       0.220787    7.310740                0.220787           7.310740            1       True          6
10       ExtraTreesGini     0.7468       0.219916    3.687547                0.219916           3.687547            1       True          8
11       ExtraTreesEntr     0.7468       0.222551    4.025320                0.222551           4.025320            1       True          9
12       KNeighborsUnif     0.6416       0.117654    0.555420                0.117654           0.555420            1       True          1
13       KNeighborsDist     0.6360       0.117145    0.582664                0.117145           0.582664            1       True          2
Number of models trained: 14
Types of models trained:
{'XGBoostModel', 'WeightedEnsembleModel', 'RFModel', 'NNFastAiTabularModel', 'KNNModel', 'XTModel', 'LGBModel', 'CatBoostModel', 'TabularNeuralNetModel'}
Bagging used: False 
Multi-layer stack-ensembling used: False 
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
('float', [])    : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', [])      : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
*** End of fit() summary ***

Evaluation: accuracy on test data: 0.7616438356164383
Evaluations on test data:
{
    "accuracy": 0.7616438356164383,
    "balanced_accuracy": 0.761491105442422,
    "mcc": 0.524730138929046,
    "f1": 0.7714135575407252,
    "precision": 0.7436676798378926,
    "recall": 0.8013100436681223
}

5.用SMOTE方法做采样增强后测试

1）SMOTE方法resample，注意非数字型的无法做，所以又删掉一些特征

from imblearn.over_sampling import SMOTE


df_new = df1.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})

xx_value=df_new[['BMI','Smoking','AlcoholDrinking','Stroke','PhysicalHealth',
           'MentalHealth','DiffWalking','Sex','PhysicalActivity',
           'SleepTime','Asthma','KidneyDisease','SkinCancer',]]

xx_value = xx_value.values
yy_value = df_new['HeartDisease'].values

smo = SMOTE(random_state=42)
X_smo, y_smo = smo.fit_resample(xx_value, yy_value)

2）训练

data = pd.DataFrame(X_smo)
data['HeartDisease']=y_smo


train, test = train_test_split(data,test_size=0.1,random_state=0)

label = 'HeartDisease'
predictor = TabularPredictor(label = label).fit(train)

3）训练摘要

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.901216       1.002287   145.400241                0.011244           3.338790            2       True         14
1        ExtraTreesGini   0.897036       0.224937    24.353068                0.224937          24.353068            1       True          8
2      RandomForestGini   0.896087       0.226434    35.604268                0.226434          35.604268            1       True          5
3      RandomForestEntr   0.895897       0.226969    37.558002                0.226969          37.558002            1       True          6
4        ExtraTreesEntr   0.895517       0.222005    18.165524                0.222005          18.165524            1       True          9
5        KNeighborsDist   0.852204       0.539366   235.674579                0.539366         235.674579            1       True          2
6            LightGBMXT   0.843465       1.915129   140.687293                1.915129         140.687293            1       True          3
7         LightGBMLarge   0.843275       0.187716    24.997420                0.187716          24.997420            1       True         13
8              LightGBM   0.840995       0.124987    19.548694                0.124987          19.548694            1       True          4
9        KNeighborsUnif   0.832067       0.530559   221.375748                0.530559         221.375748            1       True          1
10              XGBoost   0.831497       0.060697    65.653685                0.060697          65.653685            1       True         11
11             CatBoost   0.826368       0.015359    15.212961                0.015359          15.212961            1       True          7
12       NeuralNetMXNet   0.821619       0.185644  5761.783737                0.185644        5761.783737            1       True         12
13      NeuralNetFastAI   0.799962       0.136735   956.822347                0.136735         956.822347            1       True         10
Number of models trained: 14
Types of models trained:
{'XGBoostModel', 'WeightedEnsembleModel', 'RFModel', 'NNFastAiTabularModel', 'KNNModel', 'XTModel', 'LGBModel', 'CatBoostModel', 'TabularNeuralNetModel'}
Bagging used: False 
Multi-layer stack-ensembling used: False 
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', []) : 13 | ['0', '1', '2', '3', '4', ...]
*** End of fit() summary ***

4）测试

#测试
y_test=test[label]
test_nolab = test.drop(columns=[label])
y_pred = predictor.predict(test_nolab)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
#看一下各模型测试效果
predictor_leaderboard=predictor.leaderboard(test, silent=True)

结果：

Evaluation: accuracy on test data: 0.8909634949132256
Evaluations on test data:
{
    "accuracy": 0.8909634949132256,
    "balanced_accuracy": 0.891186628285348,
    "mcc": 0.785394718318957,
    "f1": 0.8862489074401099,
    "precision": 0.9312490628280102,
    "recall": 0.8453973115535137
}

predictor_leaderboard
Out[124]: 
                  model  score_test  ...  can_infer  fit_order
0   WeightedEnsemble_L2    0.890963  ...       True         14
1        ExtraTreesEntr    0.888518  ...       True          9
2        ExtraTreesGini    0.888262  ...       True          8
3      RandomForestGini    0.886638  ...       True          5
4      RandomForestEntr    0.886621  ...       True          6
5        KNeighborsDist    0.847568  ...       True          2
6         LightGBMLarge    0.842370  ...       True         13
7            LightGBMXT    0.838420  ...       True          3
8              LightGBM    0.836762  ...       True          4
9               XGBoost    0.830760  ...       True         11
10       KNeighborsUnif    0.827785  ...       True          1
11             CatBoost    0.823852  ...       True          7
12       NeuralNetMXNet    0.810840  ...       True         12
13      NeuralNetFastAI    0.791143  ...       True         10

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/718963.html

autogluon秒杀机器学习分类问题

发表评论

评论列表（0条）