如何纠正我的朴素贝叶斯方法返回极小的条件概率？_python

概述我正在尝试用Naive Bayes计算电子邮件是垃圾邮件的概率.我有一个文档类来创建文档(从网站提供),另一个类用于训练和分类文档.我的列车功能计算所有文档中的所有唯一条款,垃圾邮件类中的所有文档,非垃圾邮件类中的所有文档,计算先验概率(一个用于垃圾邮件,另一个用于火腿).然后我使用以下公式将每个术语的条件概率存储到dict中 Tct =给定类中术语的出现次数 Tct’是给定类中的#术语 B’=所我正在尝试用Naive Bayes计算电子邮件是垃圾邮件的概率.我有一个文档类来创建文档(从网站提供),另一个类用于训练和分类文档.我的列车功能计算所有文档中的所有唯一条款,垃圾邮件类中的所有文档,非垃圾邮件类中的所有文档,计算先验概率(一个用于垃圾邮件,另一个用于火腿).然后我使用以下公式将每个术语的条件概率存储到dict中

Tct =给定类中术语的出现次数
Tct’是给定类中的#术语
B’=所有文件中的#唯一术语

classes =垃圾邮件或火腿
spam = spam,ham = not spam

问题是,当我在我的代码中使用这个公式时,它给了我非常小的条件概率分数,如2.461114392596968e-05.我很确定这是因为Tct的值非常小(如5或8),而Tct’的分母值(火腿为64878,垃圾邮件为308930)和B'(为16386).我无法弄清楚如何将condprob分数降低到像.00034155这样的东西,因为我只能假设我的condprob分数不应该像它们那样指数级小.我的计算错了吗？这些值实际上应该是这么小吗？
如果它有帮助,我的目标是为一组测试文档打分并获得如327.82,758.80或138.66的结果
使用这个公式

但是,使用我的小condprob值我只得到负数.

码

– 创建文档

class document(object):"""The instance variables are:filename....The path of the file for this document.label.......The true class label ('spam' or 'ham'),determined by whether the filename contains the string 'spmsg'tokens......A List of token strings."""def __init__(self,filename=None,label=None,tokens=None):    """ Initialize a document either from a file,in which case the label    comes from the file name,or from specifIEd label and tokens,but not    both.    """    if label: # specify from label/tokens,for testing.        self.label = label        self.tokens = tokens    else: # specify from file.        self.filename = filename        self.label = 'spam' if 'spmsg' in filename else 'ham'        self.tokenize()def tokenize(self):    self.tokens = ' '.join(open(self.filename).readlines()).split()

-NaiveBayes

class NaiveBayes(object):def train(self,documents):    """    Given a List of labeled document objects,compute the class priors and    word conditional probabilitIEs,following figure 13.2 of your    book. Store these as instance variables,to be used by the classify    method subsequently.    Params:      documents...A List of training documents.    Returns:      nothing.    """    ###Todo    unique = []    proxy = []    proxy2 = []    proxy3 = []    condprob = [{},{}]    Tct = defaultdict()    Tc_t = defaultdict()    prior = {}    count = 0    oldterms = []    old_terms = []    for a in range(len(documents)):        done = False        for item in documents[a].tokens:            if item not in unique:                unique.append(item)            if documents[a].label == "ham":                proxy2.append(item)                if done == False:                    count += 1            elif documents[a].label == "spam":                proxy3.append(item)            done = True    V = unique    N = len(documents)    print("N:",N)    LB = len(unique)    print("THIS IS LB:",LB)    self.V = V    print("THIS IS COUNT/NC",count)    Nc = count    prior["ham"] = Nc / N    self.prior = prior    Nc = len(documents) - count    print("THIS IS SPAM COUNT/NC",Nc)    prior["spam"] = Nc / N    self.prior = prior    text2 = proxy2    text3 = proxy3    TctTotal = len(text2)    Tc_tTotal = len(text3)    print("THIS IS TCTOTAL",TctTotal)    print("THIS IS TC_TTOTAL",Tc_tTotal)    for term in text2:        if term not in oldterms:            Tct[term] = text2.count(term)            oldterms.append(term)    for term in text3:        if term not in old_terms:            Tc_t[term] = text3.count(term)            old_terms.append(term)    for term in V:        if term in text2:            condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)})        if term in text3:            condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)})    print("This is condprob",condprob)    self.condprob = condprobdef classify(self,documents):    """ Return a List of strings,either 'spam' or 'ham',for each document.    Params:      documents....A List of document objects to be classifIEd.    Returns:      A List of label strings corresponding to the predictions for each document.    """    ###Todo    #return List["string1","string2","stringn"]    # docs2 = ham,condprob[0] is ham    # docs3 = spam,condprob[1] is spam    unique = []    ans = []    hscore = 0    sscore = 0    for a in range(len(documents)):        for item in documents[a].tokens:            if item not in unique:                unique.append(item)        W = unique        hscore = math.log(float(self.prior['ham']))        sscore = math.log(float(self.prior['spam']))        for t in W:            try:                hscore += math.log(self.condprob[0][t])            except KeyError:                continue            try:                sscore += math.log(self.condprob[1][t])            except KeyError:                continue        print("THIS IS Sscore",sscore)        print("THIS IS Hscore",hscore)        unique = []        if hscore > sscore:            str = "Spam"        elif sscore > hscore:            str = "Ham"        ans.append(str)    return ans

-测试

if not os.path.exists('train'):  # download datafrom urllib.request import urlretrIEveimport tarfileurlretrIEve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz','lingspam.tgz')tar = tarfile.open('lingspam.tgz')tar.extractall()tar.close()train_docs = [document(filename=f) for f in glob.glob("train/*.txt")]test_docs = [document(filename=f) for f in glob.glob("test/*.txt")]test = train_docsnb = NaiveBayes()nb.train(train_docs[1500:])#uncomment when testing classify()#predictions = nb.classify(test_docs[:200])#print("PREDICTIONS",predictions)

最终目标是能够将文档分类为垃圾邮件或火腿,但我想首先处理条件概率问题.

问题
条件概率值应该是这么小吗？如果是这样,为什么我通过分类获得奇怪的分数？如果没有,我如何修复我的代码以给我正确的condprob值？

值
我得到的当前condprob值是这样的：
‘Tradition’：2.461114392596968e-05,’fillmore’：2.461114392596968e-05,’796’：2.461114392596968e-05,’zann’：2.461114392596968e-05
condprob是一个包含两个词典的列表,第一个是火腿,第二个是垃圾邮件.每个字典都将一个术语映射到它的条件概率.我希望有“正常”的小值,例如.00031235而不是3.1235e-05.
这样做的原因是,当我通过classify方法运行condprob值和一些测试文档时,得到的分数就像
这是Hscore -2634.5292392650663,这是Sscore -1707.983339196181
什么时候应该看起来像
这是Hscore 327.82,这是Sscore 758.80

运行时间
~1分30秒

解决方法 (你似乎正在使用日志概率,这是非常明智的,但我将编写以下大部分原始概率,你可以通过采用对数概率的指数来获得,因为它使代数更容易均匀如果它在实践中确实意味着如果你不使用日志你可能会得到数值下溢)

从我的代码中可以看出,你先从先验概率p(Ham)和p(Spam)开始,然后使用从先前数据估计的概率来计算p(Ham)* p(观测数据| Ham)和p(垃圾邮件) )* p(观察数据|垃圾邮件).

贝叶斯定理重新排列p(Obs | Spam)= p(Obs& Spam)/ p(Spam)= p(Obs)p(Spam | Obs)/ p(Spam)给你P(Spam | Obs)= p(垃圾邮件)p(Obs | Spam)/ p(Obs)你似乎计算了p(垃圾邮件)p(Obs | Spam)= p(Obs& Spam)但没有除以p(Obs).由于Ham和Spam只有两种可能性,最简单的做法可能是注意p(Obs)= p(Obs& Spam)p(Obs& Ham),所以只需将两个计算值中的每一个分开通过它们的总和,基本上缩放值,使它们确实总和为1.0.

如果从日志概率lA和lB开始,这种缩放比较棘手.为了缩放这些,我首先将它们通过粗略值作为对数来缩放它们,所以做一个减法

lA = lA – max(lA,lB)

lB = lB – max(lA,lB)

现在至少两者中较大者不会溢出.较小的仍然可能,但我宁愿处理下溢而不是溢出.现在将它们变成不完全缩放的概率：

pA = exp(lA)

pB = exp(lB)

并适当缩放,使它们增加到零

truePA = pA /(pA pB)

truePB = pB /(pA pB)

总结

以上是内存溢出为你收集整理的如何纠正我的朴素贝叶斯方法返回极小的条件概率？全部内容，希望文章能够帮你解决如何纠正我的朴素贝叶斯方法返回极小的条件概率？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/langs/1197607.html

如何纠正我的朴素贝叶斯方法返回极小的条件概率？

发表评论

评论列表（0条）