Bagging与Boosting的联系与区别

Bagging与Boosting的联系与区别,第1张

Bagging算法所利用的预测数据就是通过Bootstrap方法得到的,Bootstrap方法是非参数统计上的一种抽样方法,实质就是对观测数据进行抽样,通过新抽样样本对总体分布特征进行推断。例如我们熟知的随机森林算法中不同的分类回归树,所利用的数据集就是通过Boostrap方法重抽样得到的。而利用Boostrap方法所做的好处是避免了做交叉验证时的样本量少的问题。同时重抽样后的数据可以得到相较于原观测数据少的噪声点,所以更能获得好的分类器。

Boostrap步骤:

当然Bootstrap方法适合于小样本,难以有效划分训练集和测试集时很有用,在做集成学习中,样本集往往通过Bootstrap方法来获取,倘若样本足够多,那么交叉验证会比Bootstrap更好。

在理解了Bootsrap抽样方法后,Bagging实际就是对重抽样的多个样本集,分别建立一个分类器,进行并行模型训练。由于每个分类器之间相互独立,所以Bagging与只训练一个弱分类器相比,复杂度是相同的,所以这是一个高效的集成算法!利用Bagging的好处是它能在提高准确率、稳定性的同时,通过降低结果的方差,避免过拟合的发生。并且由于利用的Boostrap方法,所以能减少噪音的影响,体现样本真实的分布情况。

Bagging的算法流程为:

通过这个流程可以看出,由于是投票选出最终的预测结果,从而可以获得很高的精度,降低泛化误差,但是弊端就是如果对于某一块,大多数分类器给出了一个错误分类,最终分类的结果也会错误。所以Bagging就没有考虑到对于分类器错分类,或者说性能差的地方做出调整。

那我们在什么时候会利用到Bagging呢? 学习算法不稳定的时候,例如神经网络、kNN算法、线性回归子集选取等,这些都是不稳定的(弱学习算法),如果利用Bagging,则可以增强原算法,倘若原算法本身就有很高的稳定性,使用Bagging可能会适得其反。

随机森林(Random Forest)就是一个很好的利用Bagging的模型,他采用的弱分类器是决策树算法,在此基础上,引入了一个随机属性选择,这使得每个分类器的差异度增加,进而提升集成后的模型泛化能力。这里不对RF展开叙述,读者可参看以下相关参考。

相关参考:

与Bagging一样,Boosting也是集成算法中重要的算法,他与Bagging不同的是,Bagging采取的是并行计算,而Boosting是串行计算,对多个模型预测结果相加得到最终的结果。

在之前我们也说过,Bagging没有考虑在基学习器性能差的地方做出调整,所以Boosting在整个运行机制上做出了改进,具体可描述为:先用基学习器在初始训练集中训练,再根据基学习器表现对预测错的样本赋予更大的权值,从而在后续的学习器训练中受到更多的关注。这样根据基学习器对样本分布做出调整后,再将其训练下一个基学习器,反复分布迭代,从而达到指定值。所以Boosting是基于权值的弱分类器集成!

Boosting的算法流程:

在Boosting的框架基础上,还提出了AdaBoost (Adaptive Boosting), GBDT(Gradient Boosting Decision Tree), XGBoost(eXtreme Gradient Boosting),lightGBM(Light Gradient Boosting Machine)等。其中最具代表性的算法是AdaBoost,结合Boosting的算法流程,Adaboost主要是通过对迭代后的分类器权值与分类器的线性组合作为最终的分类器。其中最关键的就是如何得到权值的更新公式,而这是通过最小化AdaBoost的基本分类器的损失函数得到的。

下面对权值的更新进行推导:

AdaBoost的算法流程:

AdaBoost系列主要解决了: 两类问题、多类单标签问题、多类多标签问题、大类单标签问题,回归问题等,并且在实现过程中简单高效,没有超参数调节,但是Adaboost对于噪音数据和异常数据十分敏感,这种异常样本在迭代中可能会获得较高的权重,影响预测结果。此外,当其中的基分类器是分类回归树时,此时就变成了提升树,这里不阐述。

相关参考:

Bagging和Boosting都是集成学习的两种主流方法,都是由弱分类器融合成强分类器。

In the previous passage, I talked about the conception of Decision Tree and its use. Although being very powerful model that can handle both regression and classification tasks, decision trees usually suffer from high variance . This means that if we split the dataset into two parts at random and fit a decision tree to the two halves, we will get quite different results. Thus we need one approach to reduce variance at the expense of bias.

Bagging, which is designed for the context, is just the procedure for reducing the variance of weak models. In bagging , a random sample of data in the training set is selected with replacement - which means individual data points can be chosen more than once - and then we fit a weak learner, such as a decision tree, to each of the sample data. Finally, we aggregate the predictions of base learners to get a more accurate estimate.

We build B distinct sample datasets from the train set using bootstrapped training data sets, and calculate the predictionusing B separate training sets, and average them in order to obtain a low-variance statistical model:

While bagging can reduce the variance for many models, it is particularly useful for decision trees. To apply bagging, we simply construct B separate bootstrapped training sets and train B individual decision trees on these training sets. Each tree can grow very deep and not be pruned, thus they have high variance but low bias. Hence, averaging these trees can reduce variance.

Bagging has three steps to complete: bootstraping, parallel training, and aggregating.

There are a number of key benefits for bagging, including:

The key disadvantages of bagging are:

Now we practice how to use bagging to improve the performance of models. The scikit-learn Python machine learning library provides easy access to the bagging method.

First, we use make_classification function to construct the classification dataset for practice of the bagging problem.

Here, we make a binary problem dataset with 1000 observations and 30 input features.

(2250, 30) (750, 30) (2250,) (750,)

To demonstrate the benefits of bagging model, we first build one decision tree and compare it to bagging model.

Now we begin construct an ensemble model using bagging technique.

Based on the result, we can easily find that the ensemble model reduces both bias(higher accuracy) and variance(lower std). Bagging model's accuracy is 0.066 higher than that of one single decision tree.

Make Prediction

BaggingClasifier can make predictions for new cases using the function predict .

Then we build a bagging model for the regression model. Similarly, we use make_regression function to make a dataset about the regression problem.

As we did before, we still use repeated k-fold cross-validation to evaluate the model. But one thing is different than the case of classification. The cross-validation feature expects a utility function rather than a cost function. In other words, the function thinks being greater is better rather than being smaller.

The scikit-learn package will make the metric, such as neg_mean_squared_erro negative so that is maximized instead of minimized. This means that a larger negative MSE is better. We can add one "+" before the score.

The mean squared error for decision tree isand variance is .

On the other hand, a bagging regressor performs much better than one single decision tree. The mean squared error isand variance is . The bagging reduces both bias and variance.

In this section, we explore how to tune the hyperparameters for the bagging model.

We demonstrate this by performing a classification task.

Recall that the bagging is implemented by building a number of bootstrapped samples, and then building a weak learner for each sample data. The number of models we build corresponds to the parameter n_estimators .

Generally, the number of estimators can increase constantly until the performance of the ensemble model converges. And it is worth noting that using a very large number of n_estimators will not lead to overfitting.

Now let's try a different number of trees and examine the change in performance of the ensemble model.

Number of Trees 10: 0.862 0.038

Number of Trees 50: 0.887 0.025

Number of Trees 100: 0.888 0.027

Number of Trees 200: 0.89 0.027

Number of Trees 300: 0.888 0.027

Number of Trees 500: 0.888 0.028

Number of Trees 1000: 0.892 0.027

Number of Trees 2000: 0.889 0.029

Let's look at the distribution of scores

In this case, we can see that the performance of the bagging model converges to 0.888 when we grow 100 trees. The accuracy becomes flat after 100.

Now let's explore the number of samples in bootstrapped dataset. The default is to create the same number of samples as the original train set.

Number of Trees 0.1: 0.801 0.04

Number of Trees 0.2: 0.83 0.039

Number of Trees 0.30000000000000004: 0.849 0.029

Number of Trees 0.4: 0.842 0.031

Number of Trees 0.5: 0.856 0.039

Number of Trees 0.6: 0.866 0.037

Number of Trees 0.7000000000000001: 0.856 0.033

Number of Trees 0.8: 0.868 0.036

Number of Trees 0.9: 0.866 0.025

Number of Trees 1.0: 0.865 0.035

Similarly, look at the distribution of scores

The rule of thumb is that we set the max_sample to 1, but this does not mean all training observations will be selected from the train set. Since we leverage bootstrapping technique to select data from the training set at random with replacement, only about 63% of training instances are sampled on average on each predictor, while the remaining 37% of training instances are not sampled and thus called out-of-bag instances.

Since the ensemble predictor never sees the oob samples during training, it can be evaluated on these instances, without additional need for cross-validation after training. We can use out-of-bag evaluation in scikit-learn by setting oob_score=True .

Let's try to use the out-of-bag score to evaluate a bagging model.

According to this oob evaluation, this BaggingClassifier is likely to achieve about 87.6% accuracy on the test set. Let’s verify this:

The BaggingClassifier class supports sampling the features as well. This is controlled by two hyperparameters: max_features and bootstrap_features. They work the same way as max_samples and bootstrap, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features.

The random sampling of features is particularly useful for high-dimensional inputs, such as images. Randomly sampling both features and instances is called Random Patches . On the other hand, keeping all instances( bootstrap=False,max_sample=1.0 ) and sampling features( bootstrap_features=True,max_features smaller than 1.0 ) is called Random Subspaces.

Random subspaces ensemble is an extension to bagging ensemble model. It is created by a subset of features in the training set. Very similar to Random Forest , random subspace ensemble is different from it in only two aspects:

Sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

Reference:


欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/yw/11543723.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2023-05-16
下一篇 2023-05-16

发表评论

登录后才能评论

评论列表(0条)

保存