Tct =给定类中术语的出现次数
Tct’是给定类中的#术语
B’=所有文件中的#唯一术语
classes =垃圾邮件或火腿
spam = spam,ham = not spam
问题是,当我在我的代码中使用这个公式时,它给了我非常小的条件概率分数,如2.461114392596968e-05.我很确定这是因为Tct的值非常小(如5或8),而Tct’的分母值(火腿为64878,垃圾邮件为308930)和B'(为16386).我无法弄清楚如何将condprob分数降低到像.00034155这样的东西,因为我只能假设我的condprob分数不应该像它们那样指数级小.我的计算错了吗?这些值实际上应该是这么小吗?
如果它有帮助,我的目标是为一组测试文档打分并获得如327.82,758.80或138.66的结果
使用这个公式
但是,使用我的小condprob值我只得到负数.
码
– 创建文档
class document(object):"""The instance variables are:filename....The path of the file for this document.label.......The true class label ('spam' or 'ham'),determined by whether the filename contains the string 'spmsg'tokens......A List of token strings."""def __init__(self,filename=None,label=None,tokens=None): """ Initialize a document either from a file,in which case the label comes from the file name,or from specifIEd label and tokens,but not both. """ if label: # specify from label/tokens,for testing. self.label = label self.tokens = tokens else: # specify from file. self.filename = filename self.label = 'spam' if 'spmsg' in filename else 'ham' self.tokenize()def tokenize(self): self.tokens = ' '.join(open(self.filename).readlines()).split()
-NaiveBayes
class NaiveBayes(object):def train(self,documents): """ Given a List of labeled document objects,compute the class priors and word conditional probabilitIEs,following figure 13.2 of your book. Store these as instance variables,to be used by the classify method subsequently. Params: documents...A List of training documents. Returns: nothing. """ ###Todo unique = [] proxy = [] proxy2 = [] proxy3 = [] condprob = [{},{}] Tct = defaultdict() Tc_t = defaultdict() prior = {} count = 0 oldterms = [] old_terms = [] for a in range(len(documents)): done = False for item in documents[a].tokens: if item not in unique: unique.append(item) if documents[a].label == "ham": proxy2.append(item) if done == False: count += 1 elif documents[a].label == "spam": proxy3.append(item) done = True V = unique N = len(documents) print("N:",N) LB = len(unique) print("THIS IS LB:",LB) self.V = V print("THIS IS COUNT/NC",count) Nc = count prior["ham"] = Nc / N self.prior = prior Nc = len(documents) - count print("THIS IS SPAM COUNT/NC",Nc) prior["spam"] = Nc / N self.prior = prior text2 = proxy2 text3 = proxy3 TctTotal = len(text2) Tc_tTotal = len(text3) print("THIS IS TCTOTAL",TctTotal) print("THIS IS TC_TTOTAL",Tc_tTotal) for term in text2: if term not in oldterms: Tct[term] = text2.count(term) oldterms.append(term) for term in text3: if term not in old_terms: Tc_t[term] = text3.count(term) old_terms.append(term) for term in V: if term in text2: condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)}) if term in text3: condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)}) print("This is condprob",condprob) self.condprob = condprobdef classify(self,documents): """ Return a List of strings,either 'spam' or 'ham',for each document. Params: documents....A List of document objects to be classifIEd. Returns: A List of label strings corresponding to the predictions for each document. """ ###Todo #return List["string1","string2","stringn"] # docs2 = ham,condprob[0] is ham # docs3 = spam,condprob[1] is spam unique = [] ans = [] hscore = 0 sscore = 0 for a in range(len(documents)): for item in documents[a].tokens: if item not in unique: unique.append(item) W = unique hscore = math.log(float(self.prior['ham'])) sscore = math.log(float(self.prior['spam'])) for t in W: try: hscore += math.log(self.condprob[0][t]) except KeyError: continue try: sscore += math.log(self.condprob[1][t]) except KeyError: continue print("THIS IS Sscore",sscore) print("THIS IS Hscore",hscore) unique = [] if hscore > sscore: str = "Spam" elif sscore > hscore: str = "Ham" ans.append(str) return ans
-测试
if not os.path.exists('train'): # download datafrom urllib.request import urlretrIEveimport tarfileurlretrIEve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz','lingspam.tgz')tar = tarfile.open('lingspam.tgz')tar.extractall()tar.close()train_docs = [document(filename=f) for f in glob.glob("train/*.txt")]test_docs = [document(filename=f) for f in glob.glob("test/*.txt")]test = train_docsnb = NaiveBayes()nb.train(train_docs[1500:])#uncomment when testing classify()#predictions = nb.classify(test_docs[:200])#print("PREDICTIONS",predictions)
最终目标是能够将文档分类为垃圾邮件或火腿,但我想首先处理条件概率问题.
问题
条件概率值应该是这么小吗?如果是这样,为什么我通过分类获得奇怪的分数?如果没有,我如何修复我的代码以给我正确的condprob值?
值
我得到的当前condprob值是这样的:
‘Tradition’:2.461114392596968e-05,’fillmore’:2.461114392596968e-05,’796’:2.461114392596968e-05,’zann’:2.461114392596968e-05
condprob是一个包含两个词典的列表,第一个是火腿,第二个是垃圾邮件.每个字典都将一个术语映射到它的条件概率.我希望有“正常”的小值,例如.00031235而不是3.1235e-05.
这样做的原因是,当我通过classify方法运行condprob值和一些测试文档时,得到的分数就像
这是Hscore -2634.5292392650663,这是Sscore -1707.983339196181
什么时候应该看起来像
这是Hscore 327.82,这是Sscore 758.80
运行时间
~1分30秒
从我的代码中可以看出,你先从先验概率p(Ham)和p(Spam)开始,然后使用从先前数据估计的概率来计算p(Ham)* p(观测数据| Ham)和p(垃圾邮件) )* p(观察数据|垃圾邮件).
贝叶斯定理重新排列p(Obs | Spam)= p(Obs& Spam)/ p(Spam)= p(Obs)p(Spam | Obs)/ p(Spam)给你P(Spam | Obs)= p(垃圾邮件)p(Obs | Spam)/ p(Obs)你似乎计算了p(垃圾邮件)p(Obs | Spam)= p(Obs& Spam)但没有除以p(Obs).由于Ham和Spam只有两种可能性,最简单的做法可能是注意p(Obs)= p(Obs& Spam)p(Obs& Ham),所以只需将两个计算值中的每一个分开通过它们的总和,基本上缩放值,使它们确实总和为1.0.
如果从日志概率lA和lB开始,这种缩放比较棘手.为了缩放这些,我首先将它们通过粗略值作为对数来缩放它们,所以做一个减法
lA = lA – max(lA,lB)
lB = lB – max(lA,lB)
现在至少两者中较大者不会溢出.较小的仍然可能,但我宁愿处理下溢而不是溢出.现在将它们变成不完全缩放的概率:
pA = exp(lA)
pB = exp(lB)
并适当缩放,使它们增加到零
truePA = pA /(pA pB)
truePB = pB /(pA pB)
总结以上是内存溢出为你收集整理的如何纠正我的朴素贝叶斯方法返回极小的条件概率?全部内容,希望文章能够帮你解决如何纠正我的朴素贝叶斯方法返回极小的条件概率?所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)