autogluon竟都不需要对str类数据进行处理,为了简便只对yes/no做了一个简单处理,同时,预测的label列也不用专门摘出来,在后续训练中直接指定即可。
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd
df1=pd.read_csv('/Users/johnny/Downloads/CreditMaster/heart_2020_cleaned.csv')
healthy = df1[df1['HeartDisease']=='No']
unhealthy = df1[df1['HeartDisease']=='Yes']
up_sampled = resample(unhealthy, replace=True, n_samples=len(healthy))
df_new = pd.concat([healthy,up_sampled])
df_new = df_new.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})
train, test = train_test_split(df_new,test_size=0.1,random_state=0)
2.模型训练
指定要预测的心脏病列尾预测label,然后调用fit即可。
label = 'HeartDisease'
predictor = TabularPredictor(label = label,presets='best_quality', ).fit(train)
TabularPredictor()
#训练过程中发生了什么
results = predictor.fit_summary()
No path specified. Models will be saved in: "AutogluonModels/ag-20220418_130233/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220418_130233/"
AutoGluon Version: 0.2.0
Train Data Rows: 526359
Train Data Columns: 17
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [1, 0]
If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 3317.47 MB
Train Data (Original) Memory Usage: 172.08 MB (5.2% of available memory)
Warning: Data size prior to feature transformation consumes 5.2% of available memory. Consider increasing memory or subsampling the data to avoid instability.
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', []) : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
('object', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
('float', []) : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', []) : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
2.5s = Fit runtime
17 features in original data used to generate 17 features in processed data.
Train Data (Processed) Memory Usage: 56.85 MB (1.7% of available memory)
Data preprocessing and feature engineering runtime = 2.87s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric argument of fit()
Automatically generating train/validation split with holdout_frac=0.01, Train Rows: 521095, Val Rows: 5264
Fitting model: KNeighborsUnif ...
0.8575 = Validation accuracy score
210.89s = Training runtime
0.8s = Validation runtime
Fitting model: KNeighborsDist ...
0.8733 = Validation accuracy score
222.96s = Training runtime
0.54s = Validation runtime
Fitting model: LightGBMXT ...
0.762 = Validation accuracy score
2.54s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBM ...
[1000] train_set's binary_error: 0.199513 valid_set's binary_error: 0.206117
[2000] train_set's binary_error: 0.177611 valid_set's binary_error: 0.18712
[3000] train_set's binary_error: 0.159547 valid_set's binary_error: 0.171353
[4000] train_set's binary_error: 0.144434 valid_set's binary_error: 0.158245
[5000] train_set's binary_error: 0.131868 valid_set's binary_error: 0.143427
[6000] train_set's binary_error: 0.120433 valid_set's binary_error: 0.137538
[7000] train_set's binary_error: 0.111007 valid_set's binary_error: 0.129939
[8000] train_set's binary_error: 0.102417 valid_set's binary_error: 0.120631
[9000] train_set's binary_error: 0.0965006 valid_set's binary_error: 0.117781
[10000] train_set's binary_error: 0.0900143 valid_set's binary_error: 0.111512
0.8889 = Validation accuracy score
200.23s = Training runtime
2.84s = Validation runtime
Fitting model: RandomForestGini ...
Warning: Reducing model 'n_estimators' from 300 -> 69 due to low memory. Expected memory usage reduced from 65.11% -> 15.0% of available memory...
0.9745 = Validation accuracy score
33.25s = Training runtime
0.24s = Validation runtime
Fitting model: RandomForestEntr ...
Warning: Reducing model 'n_estimators' from 300 -> 79 due to low memory. Expected memory usage reduced from 56.4% -> 15.0% of available memory...
0.9723 = Validation accuracy score
37.25s = Training runtime
0.23s = Validation runtime
Fitting model: CatBoost ...
0.7625 = Validation accuracy score
13.05s = Training runtime
0.03s = Validation runtime
Fitting model: ExtraTreesGini ...
Warning: Reducing model 'n_estimators' from 300 -> 55 due to low memory. Expected memory usage reduced from 80.61% -> 15.0% of available memory...
0.9675 = Validation accuracy score
18.75s = Training runtime
0.23s = Validation runtime
Fitting model: ExtraTreesEntr ...
Warning: Reducing model 'n_estimators' from 300 -> 55 due to low memory. Expected memory usage reduced from 81.73% -> 15.0% of available memory...
0.966 = Validation accuracy score
20.15s = Training runtime
0.23s = Validation runtime
Fitting model: NeuralNetFastAI ...
0.7924 = Validation accuracy score
1075.78s = Training runtime
0.19s = Validation runtime
Fitting model: XGBoost ...
0.7642 = Validation accuracy score
21.55s = Training runtime
0.09s = Validation runtime
Fitting model: NeuralNetMXNet ...
0.8539 = Validation accuracy score
3770.0s = Training runtime
0.54s = Validation runtime
Fitting model: LightGBMLarge ...
0.7667 = Validation accuracy score
2.05s = Training runtime
0.04s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.9753 = Validation accuracy score
4.05s = Training runtime
0.03s = Validation runtime
AutoGluon training complete, total runtime = 5663.74s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220418_130233/")
fit summary摘要
results = predictor.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.975304 0.801397 260.260579 0.025225 4.047571 2 True 14
1 RandomForestGini 0.974544 0.235665 33.249032 0.235665 33.249032 1 True 5
2 RandomForestEntr 0.972264 0.234039 37.253645 0.234039 37.253645 1 True 6
3 ExtraTreesGini 0.967515 0.227900 18.753269 0.227900 18.753269 1 True 8
4 ExtraTreesEntr 0.965995 0.230008 20.152724 0.230008 20.152724 1 True 9
5 LightGBM 0.888868 2.842197 200.232388 2.842197 200.232388 1 True 4
6 KNeighborsDist 0.873290 0.540507 222.963976 0.540507 222.963976 1 True 2
7 KNeighborsUnif 0.857523 0.798026 210.889044 0.798026 210.889044 1 True 1
8 NeuralNetMXNet 0.853913 0.544325 3769.998433 0.544325 3769.998433 1 True 12
9 NeuralNetFastAI 0.792363 0.191066 1075.781896 0.191066 1075.781896 1 True 10
10 LightGBMLarge 0.766717 0.037551 2.054227 0.037551 2.054227 1 True 13
11 XGBoost 0.764248 0.086684 21.547890 0.086684 21.547890 1 True 11
12 CatBoost 0.762538 0.025575 13.046758 0.025575 13.046758 1 True 7
13 LightGBMXT 0.761968 0.034197 2.544285 0.034197 2.544285 1 True 3
Number of models trained: 14
Types of models trained:
{'XGBoostModel', 'WeightedEnsembleModel', 'RFModel', 'NNFastAiTabularModel', 'KNNModel', 'XTModel', 'LGBModel', 'CatBoostModel', 'TabularNeuralNetModel'}
Bagging used: False
Multi-layer stack-ensembling used: False
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
('float', []) : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', []) : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
*** End of fit() summary ***
3.测试效果
#测试
y_test=test[label]
test_nolab = test.drop(columns=[label])
y_pred = predictor.predict(test_nolab)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
#看一下各模型测试效果
predictor_leaderboard=predictor.leaderboard(test, silent=True)
准确率97.4····直接秒杀了
Evaluation: accuracy on test data: 0.9745917756689749
Evaluations on test data:
{
"accuracy": 0.9745917756689749,
"balanced_accuracy": 0.9744652085677961,
"mcc": 0.950355347895357,
"f1": 0.975332005312085,
"precision": 0.9522528363047001,
"recall": 0.9995576726777815
}
model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.9745917756689749 0.9753039513677811 7.646605968475342 0.8013973236083984 260.2605793476105 0.25141096115112305 0.025225162506103516 4.047571182250977 2 True 14
1 RandomForestGini 0.9734974779858083 0.9745440729483282 2.6908559799194336 0.23566508293151855 33.24903202056885 2.6908559799194336 0.23566508293151855 33.24903202056885 1 True 5
2 RandomForestEntr 0.9724202787039412 0.9722644376899696 1.8741211891174316 0.2340388298034668 37.253644943237305 1.8741211891174316 0.2340388298034668 37.253644943237305 1 True 6
3 ExtraTreesGini 0.9699410105155168 0.9675151975683891 2.3864328861236572 0.22789978981018066 18.75326895713806 2.3864328861236572 0.22789978981018066 18.75326895713806 1 True 8
4 ExtraTreesEntr 0.9686073352141574 0.9659954407294833 2.554309844970703 0.23000812530517578 20.152723789215088 2.554309844970703 0.23000812530517578 20.152723789215088 1 True 9
5 LightGBM 0.887082157818244 0.8888677811550152 31.935580015182495 2.8421969413757324 200.23238825798035 31.935580015182495 2.8421969413757324 200.23238825798035 1 True 4
6 KNeighborsDist 0.8715910062409165 0.873290273556231 4.704339027404785 0.5405070781707764 222.96397614479065 4.704339027404785 0.5405070781707764 222.96397614479065 1 True 2
7 KNeighborsUnif 0.8559117722492947 0.8575227963525835 4.5526628494262695 0.7980260848999023 210.8890438079834 4.5526628494262695 0.7980260848999023 210.8890438079834 1 True 1
8 NeuralNetMXNet 0.8486449516970164 0.8539133738601824 4.86820387840271 0.5443248748779297 3769.9984328746796 4.86820387840271 0.5443248748779297 3769.9984328746796 1 True 12
9 NeuralNetFastAI 0.7926476874412243 0.7923632218844985 1.275090217590332 0.19106578826904297 1075.7818961143494 1.275090217590332 0.19106578826904297 1075.7818961143494 1 True 10
10 LightGBMLarge 0.7692741728648371 0.7667173252279635 0.08189797401428223 0.037551164627075195 2.054226875305176 0.08189797401428223 0.037551164627075195 2.054226875305176 1 True 13
11 XGBoost 0.7683679575959648 0.7642477203647416 0.5143089294433594 0.08668398857116699 21.547889709472656 0.5143089294433594 0.08668398857116699 21.547889709472656 1 True 11
12 LightGBMXT 0.7679404975634778 0.761968085106383 0.09745597839355469 0.034197092056274414 2.5442848205566406 0.09745597839355469 0.034197092056274414 2.5442848205566406 1 True 3
13 CatBoost 0.766948790288108 0.7625379939209727 0.05634880065917969 0.02557516098022461 13.046757936477661 0.05634880065917969 0.02557516098022461 13.046757936477661 1 True 7
4.欠采样下的效果
healthy = df1[df1['HeartDisease']=='No']
unhealthy = df1[df1['HeartDisease']=='Yes']
down_sample = resample(healthy,replace=True, n_samples=len(unhealthy))
#up_sampled = resample(unhealthy, replace=True, n_samples=len(healthy))
#df_new = pd.concat([healthy,up_sampled])
df_new = pd.concat([unhealthy,down_sample])
df_new = df_new.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})
train, test = train_test_split(df_new,test_size=0.1,random_state=0)
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.7768 0.625114 101.197595 0.005337 1.812670 2 True 14
1 NeuralNetFastAI 0.7720 0.115848 85.281701 0.115848 85.281701 1 True 10
2 LightGBM 0.7676 0.025734 0.591603 0.025734 0.591603 1 True 4
3 LightGBMXT 0.7664 0.019911 0.341896 0.019911 0.341896 1 True 3
4 CatBoost 0.7648 0.014946 1.833666 0.014946 1.833666 1 True 7
5 NeuralNetMXNet 0.7640 0.222282 117.917249 0.222282 117.917249 1 True 12
6 LightGBMLarge 0.7632 0.028388 0.849337 0.028388 0.849337 1 True 13
7 XGBoost 0.7620 0.043253 4.062333 0.043253 4.062333 1 True 11
8 RandomForestGini 0.7600 0.327819 5.914564 0.327819 5.914564 1 True 5
9 RandomForestEntr 0.7508 0.220787 7.310740 0.220787 7.310740 1 True 6
10 ExtraTreesGini 0.7468 0.219916 3.687547 0.219916 3.687547 1 True 8
11 ExtraTreesEntr 0.7468 0.222551 4.025320 0.222551 4.025320 1 True 9
12 KNeighborsUnif 0.6416 0.117654 0.555420 0.117654 0.555420 1 True 1
13 KNeighborsDist 0.6360 0.117145 0.582664 0.117145 0.582664 1 True 2
Number of models trained: 14
Types of models trained:
{'XGBoostModel', 'WeightedEnsembleModel', 'RFModel', 'NNFastAiTabularModel', 'KNNModel', 'XTModel', 'LGBModel', 'CatBoostModel', 'TabularNeuralNetModel'}
Bagging used: False
Multi-layer stack-ensembling used: False
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 4 | ['AgeCategory', 'Race', 'Diabetic', 'GenHealth']
('float', []) : 4 | ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']
('int', []) : 9 | ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', ...]
*** End of fit() summary ***
Evaluation: accuracy on test data: 0.7616438356164383
Evaluations on test data:
{
"accuracy": 0.7616438356164383,
"balanced_accuracy": 0.761491105442422,
"mcc": 0.524730138929046,
"f1": 0.7714135575407252,
"precision": 0.7436676798378926,
"recall": 0.8013100436681223
}
5.用SMOTE方法做采样增强后测试
1)SMOTE方法resample,注意非数字型的无法做,所以又删掉一些特征
from imblearn.over_sampling import SMOTE
df_new = df1.replace({'No': 0, 'Yes': 1})
df_new = df_new.replace({'Male': 0, 'Female': 1})
xx_value=df_new[['BMI','Smoking','AlcoholDrinking','Stroke','PhysicalHealth',
'MentalHealth','DiffWalking','Sex','PhysicalActivity',
'SleepTime','Asthma','KidneyDisease','SkinCancer',]]
xx_value = xx_value.values
yy_value = df_new['HeartDisease'].values
smo = SMOTE(random_state=42)
X_smo, y_smo = smo.fit_resample(xx_value, yy_value)
2)训练
data = pd.DataFrame(X_smo)
data['HeartDisease']=y_smo
train, test = train_test_split(data,test_size=0.1,random_state=0)
label = 'HeartDisease'
predictor = TabularPredictor(label = label).fit(train)
3)训练摘要
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.901216 1.002287 145.400241 0.011244 3.338790 2 True 14
1 ExtraTreesGini 0.897036 0.224937 24.353068 0.224937 24.353068 1 True 8
2 RandomForestGini 0.896087 0.226434 35.604268 0.226434 35.604268 1 True 5
3 RandomForestEntr 0.895897 0.226969 37.558002 0.226969 37.558002 1 True 6
4 ExtraTreesEntr 0.895517 0.222005 18.165524 0.222005 18.165524 1 True 9
5 KNeighborsDist 0.852204 0.539366 235.674579 0.539366 235.674579 1 True 2
6 LightGBMXT 0.843465 1.915129 140.687293 1.915129 140.687293 1 True 3
7 LightGBMLarge 0.843275 0.187716 24.997420 0.187716 24.997420 1 True 13
8 LightGBM 0.840995 0.124987 19.548694 0.124987 19.548694 1 True 4
9 KNeighborsUnif 0.832067 0.530559 221.375748 0.530559 221.375748 1 True 1
10 XGBoost 0.831497 0.060697 65.653685 0.060697 65.653685 1 True 11
11 CatBoost 0.826368 0.015359 15.212961 0.015359 15.212961 1 True 7
12 NeuralNetMXNet 0.821619 0.185644 5761.783737 0.185644 5761.783737 1 True 12
13 NeuralNetFastAI 0.799962 0.136735 956.822347 0.136735 956.822347 1 True 10
Number of models trained: 14
Types of models trained:
{'XGBoostModel', 'WeightedEnsembleModel', 'RFModel', 'NNFastAiTabularModel', 'KNNModel', 'XTModel', 'LGBModel', 'CatBoostModel', 'TabularNeuralNetModel'}
Bagging used: False
Multi-layer stack-ensembling used: False
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', []) : 13 | ['0', '1', '2', '3', '4', ...]
*** End of fit() summary ***
4)测试
#测试
y_test=test[label]
test_nolab = test.drop(columns=[label])
y_pred = predictor.predict(test_nolab)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
#看一下各模型测试效果
predictor_leaderboard=predictor.leaderboard(test, silent=True)
结果:
Evaluation: accuracy on test data: 0.8909634949132256
Evaluations on test data:
{
"accuracy": 0.8909634949132256,
"balanced_accuracy": 0.891186628285348,
"mcc": 0.785394718318957,
"f1": 0.8862489074401099,
"precision": 0.9312490628280102,
"recall": 0.8453973115535137
}
predictor_leaderboard
Out[124]:
model score_test ... can_infer fit_order
0 WeightedEnsemble_L2 0.890963 ... True 14
1 ExtraTreesEntr 0.888518 ... True 9
2 ExtraTreesGini 0.888262 ... True 8
3 RandomForestGini 0.886638 ... True 5
4 RandomForestEntr 0.886621 ... True 6
5 KNeighborsDist 0.847568 ... True 2
6 LightGBMLarge 0.842370 ... True 13
7 LightGBMXT 0.838420 ... True 3
8 LightGBM 0.836762 ... True 4
9 XGBoost 0.830760 ... True 11
10 KNeighborsUnif 0.827785 ... True 1
11 CatBoost 0.823852 ... True 7
12 NeuralNetMXNet 0.810840 ... True 12
13 NeuralNetFastAI 0.791143 ... True 10
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)