scikit学习有多项朴素贝叶斯的实现,这是在这种情况下朴素贝叶斯权变种。不过,支持向量机(SVM)可能会更好地工作。
正如Ken在评论中指出的那样,NLTK对于scikit-
learn分类器来说是一个很好的包装器。根据文档进行了修改,这是一个有点复杂的 *** 作,它执行TF-
IDF加权,根据chi2统计量选择1000个最佳功能,然后将其传递给多项式朴素的贝叶斯分类器。(我打赌这有点笨拙,因为我对NLTK或scikit-
learn都不是很熟悉。)
import numpy as npfrom nltk.probability import FreqDistfrom nltk.classify import SklearnClassifierfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_selection import SelectKBest, chi2from sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinepipeline = Pipeline([('tfidf', TfidfTransformer()), ('chi2', SelectKBest(chi2, k=1000)), ('nb', MultinomialNB())])classif = SklearnClassifier(pipeline)from nltk.corpus import movie_reviewspos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')]neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')]add_label = lambda lst, lab: [(x, lab) for x in lst]classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg'))l_pos = np.array(classif.classify_many(pos[100:]))l_neg = np.array(classif.classify_many(neg[100:]))print "Confusion matrix:n%dt%dn%dt%d" % ( (l_pos == 'pos').sum(), (l_pos == 'neg').sum(), (l_neg == 'pos').sum(), (l_neg == 'neg').sum())
这印给我:
Confusion matrix:524 376202 698
考虑到这不是一个超级容易的问题,它并不完美,但还算不错,并且仅接受100/100的培训。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)