简介模型
knnnaive bayes 数据Implementation with Python
简介Knn 是最常见,最简单的非参数机器学习的方法,它对 data generating process (DGP) 没有任何假设,所以适用于大多数场景。但是个人感觉,knn 对维度较高的数据表现不太理想,而且容易过拟合。朴素贝叶斯 (Naive Bayes) 是参数方法,有模型假设,把这两者放到一起是因为两者都是机器学习中最经典,最简单的方法。
模型假设我们有数据 ( x i , y i ) i = 1 n (x_i,y_i)_{i=1}^n (xi,yi)i=1n, x i x_i xi 是解释变量, y i y_i yi 是 p p p-维的离散真实 label。
knnp r ( y = r ∣ x = x 0 ) = # { i : y i = r , x i ∈ knn ( x 0 , d ) } k . pr(y=r|x=x_0)=frac{#{i:y_i=r, x_iin operatorname{knn}(x_0,d)} }{k}. pr(y=r∣x=x0)=k#{i:yi=r,xi∈knn(x0,d)}.其中, knn ( x , d ) operatorname{knn}(x,d) knn(x,d) 是数据 ( x i ) i = 1 n (x_i)_{i=1}^n (xi)i=1n 中离 x 0 x_0 x0 最近的 k k k 个 x i x_i xi 的集合 based on 距离 d ( x , x i ) d(x,x_i) d(x,xi)。那么算法也很简单,对任意一个点我们只需要计算训练集中每个数据点离该点的距离,然后选出最近的 k k k 个点计算即可。
naive bayesp r ( y = r ∣ x = x 0 ) = π r f r ( x 0 ) ∑ i = 1 p π k f k ( x 0 ) . pr(y=r|x=x_0)=frac{pi_rf_r(x_0)}{sum_{i=1}^ppi_kf_k(x_0)}. pr(y=r∣x=x0)=∑i=1pπkfk(x0)πrfr(x0).这个是贝叶斯公式,所以 π k = p r ( y = r ) pi_k=pr(y=r) πk=pr(y=r),可以用 π ^ k = # { i : y i = k } n widehat{pi}_k=frac{#{i: y_i=k}}{n} π k=n#{i:yi=k} 来估计。 f k ( x ) f_k(x) fk(x) 是 x ∣ y = k x|y=k x∣y=k 的密度函数,是需要指定的 prior。如果让 f k ( x ) = f k ( x , μ k , σ k ) = 1 2 π σ k 2 exp ( − ( x − μ k ) 2 2 σ k 2 ) f_k(x)=f_k(x,mu_k,sigma_k)=frac{1}{sqrt{2pisigma_k^2}}exp(-frac{(x-mu_k)^2}{2sigma_k^2}) fk(x)=fk(x,μk,σk)=2πσk2 1exp(−2σk2(x−μk)2),高斯分布,那么模型称为 Guassian naive bayes。如果令 f k ( x ) f_k(x) fk(x) 为多项式分布,即 f k ( x ) = f k ( x , θ k ) = ( ∑ i = 1 K x i ) ! ∏ i = 1 K x i ! ∏ i = 1 K ( θ k i ) x i f_k(x)=f_k(x,theta_k)=frac{(sum_{i=1}^Kx_{i})!}{prod_{i=1}^Kx_i!}prod_{i=1}^K(theta_{ki})^{x_i} fk(x)=fk(x,θk)=∏i=1Kxi!(∑i=1Kxi)!∏i=1K(θki)xi,其中 K = dim ( x ) K=operatorname{dim}(x) K=dim(x),那么模型称为 multinomial naive bayes。指定好 parametric prior distribution 后,我们就可以在训练集上训练模型了。
数据我们研究文本数据 smoker’s helpline dataset。这个文本数据来自于滑铁卢大学的戒烟热线中心。该中心收集了加拿大打电话过来的有意向戒烟者的信息,并在6个月后回电询问 “What helped you the most in trying to quit [smoking]?”。被回电者的 response (即 text 数据) 被记录且分为了20多个类别。这里,我们想要做的一件事情是根据回电者的 reponse 进行文本分类。
Implementation with Python调入需要的库:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn import neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix from sklearn.preprocessing import StandardScaler from sklearn.naive_bayes import GaussianNB, MultinomialNB import itertools
读取数据,这里我们只考虑用 2006 年的 observations 来完成我们的目标——训练分类器并在测试集上评价其表现,一共有1175个观测:
# load data df_smoker = pd.read_csv("smokerhelpline.csv", sep=',', engine='python') df_2006 = df_smoker[df_smoker.year_start==2006] df_2006.head(10)
当然,我们要先对文本数据进行处理,将其转化为可以用于模型的变量和对应值。feature 呢,是一些 n-grams,对应的值可以是该 gram 在 text 中出现的频率,也可以使用 tf-idf 的方法来计算对应值。自由化的处理可以用 nltk 库,其中包含有很多处理文本的函数,可以进行 tokenize, stemming, remove stopwords, lemmarization. remove punctutions, etc. 各种 *** 作。在这里,我们直接用 sklearn.feature_extraction.text 中的 CountVectorizer 和 TfidfVectorizer 来将文本转化成我们想要的形式:
docs_2006 = df_2006["text"] # vectorization vectorizer_count = CountVectorizer(stop_words = "english", max_features = 600) vectorizer_tfidf = TfidfVectorizer(stop_words = "english", max_features = 600) # fit text vectorizer_count.fit(docs_2006) vectorizer_tfidf.fit(docs_2006)
# encode document x = vectorizer_count.transform(docs_2006).toarray() y = df_2006["code"] # x = vectorizer_tfidf.transform(docs_2006) # summarize encoded vector print('shape: ', x.shape) results = pd.Dataframe(x, columns=vectorizer_count.get_feature_names()) results.head(10)
CountVectorizer 有很多参数,比如你可以自己指定 tokenizer 和 stemmer,还可以设定 max_features,这里我 max_features 设置的600。现在,我们
x
x
x 和
y
y
y 都有了,我们分一下 train data 和 test data:
# train and test data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
我们考虑三个模型:3-nn,gaussiannb 和 multinomialnb:
n_neighbors = 3 # knn model_knn = neighbors.KNeighborsClassifier(n_neighbors, weights = "distance", metric = 'hamming') # naive Beyessian model_gaussnb = GaussianNB() model_multinb = MultinomialNB(alpha = 0.2) # fit model model_knn.fit(x_train, y_train) model_gaussnb.fit(x_train, y_train) model_multinb.fit(x_train, y_train)
MultinomialNB 是通过公式 θ ^ k j ( α ) = ∑ y i = k x j i + α ∑ j = 1 K ∑ y i = k x j i + n α hat{theta}_{kj}(alpha)=frac{sum_{y_i=k}x_{ji}+alpha}{sum_{j=1}^Ksum_{y_i=k}x_{ji}+nalpha} θ^kj(α)=∑j=1K∑yi=kxji+nα∑yi=kxji+α 来估计 θ k theta_k θk 的第 j j j 个元素 θ k j theta_{kj} θkj 的,所以要自己指定 α alpha α。在 train data 上训练好模型后,我们在 test data 上 evaluate their performance:
# model evaluation print("Accuracy of knn on train data:", np.round(model_knn.score(x_train, y_train)*100,2),"%") print("Accuracy of knn on test data:", np.round(model_knn.score(x_test, y_test)*100,2),"%") print("Accuracy of gaussiannb on train data:", np.round(model_gaussnb.score(x_train, y_train)*100,2),"%") print("Accuracy of gaussiannb on test data:", np.round(model_gaussnb.score(x_test, y_test)*100,2),"%") print("Accuracy of multinomialnb on train data:", np.round(model_multinb.score(x_train, y_train)*100,2),"%") print("Accuracy of multinomialnb on test data:", np.round(model_multinb.score(x_test, y_test)*100,2),"%")
Accuracy of knn on train data: 98.09 % Accuracy of knn on test data: 62.55 % Accuracy of gaussiannb on train data: 56.91 % Accuracy of gaussiannb on test data: 34.89 % Accuracy of multinomialnb on train data: 87.55 % Accuracy of multinomialnb on test data: 70.21 %
我们可以发现在我们设定的参数下, knn 存在严重的过拟合,效果一般。GaussianNB 效果非常糟糕,MutinomialNB 的效果最好,在 test data 上达到了 70% 的准确率。可以画一下 confusion matrix:
# predict y_test and calculate confusion matrix y_pred_knn = model_knn.predict(x_test) cm_knn = confusion_matrix(y_test, y_pred_knn) print("confusion matrix for knn prediction:") print(cm_knn) # multinomial nb y_pred_multinb = model_multinb.predict(x_test) cm_multinb = confusion_matrix(y_test, y_pred_multinb)
# define cm-heatmap function def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Oranges, font_size = 5): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title, fontsize=font_size) plt.colorbar(fraction=0.046, pad=0.04) tick_marks = np.arange(cm.shape[1]) plt.xticks(tick_marks, fontsize=font_size) # ax = plt.gca() # ax.set_xticklabels((ax.get_xticks() +1).astype(str)) plt.yticks(tick_marks, fontsize=font_size) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j]), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black", fontsize=font_size) plt.tight_layout() plt.ylabel('True label', fontsize=font_size) plt.xlabel('Predicted label', fontsize=font_size)
plt.subplots(figsize = (10,10)) print('Confusion matrix for multinomialnb:') plot_confusion_matrix(cm_multinb, font_size = 15)
confusion matrix for knn prediction: [[ 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 1 0 0 3 0 0 0 0 5 0 0 0 0 0 0 0 0 0] [ 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0] [ 0 0 0 0 5 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0] [ 0 0 1 0 0 9 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 2 0] [ 0 0 1 0 0 1 9 0 0 0 0 0 0 0 4 0 0 0 0 0 0 1 0 0] [ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 2 0 0 0 0 0 6 0 0 0 0 1 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 1 0] [ 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0 0 0 0 0] [ 1 0 0 0 0 1 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 1 1 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 2 2 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 0 0 1 0 0 1 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 0 0 6 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 11 0 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 0 0 32 0] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 8]]
Confusion matrix for multinomialnb:
当然,几个方法效果都很一般。我们后面会试一下 svm, tree based method 以及 deep learning 在这个数据集上的效果。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)