贝叶斯算法-情感分类_python

课程名称：Artificial Intelligence 人工智能: 中山大学本科生实验报告实验题目

朴素贝叶斯法完成文本信息情感分类训练，要求使用拉普拉斯平滑技巧：
在给定文本数据集完成文本情感分类训练，在测试集完成测试，计算准确率。(提示：可借助 sklearn 机器学习库完成文本特征(tf-idf)提取)
思考：在前面的文本分类算法中，如果测试文本中的单词没有在训练文本中出现会造成什么结果？
会影响到后验概率的计算结果，使分类产生偏差。解决这一问题的方法是采用贝叶斯估计。具体地，方法为：

实验内容算法原理

（1）使用read_data()函数分别读取train.txt和test.txt文件。
（2）将读取内容的每一行进行加工并打包成node类，方便以后提取使用。
（3）对test.txt文件里的每一行，通过bayes()函数找到在训练集条件下的最大可能的情感。
（4）将得到的情感与test.txt文件里的情感对比，得到正确个数，并求出正确率。

关键代码展示（带注释）

（1）将每一行文本打包

# 打包每一句话的信息
class node:
    def __init__(self,word_set,dic,emotion,word):
        self.word_set = word_set # 词集
        self.dic = dic # 词频
        self.emotion = emotion # 情感
        self.word = word # 内容

（2）读取数据

# 读取数据
def read_data(file):
    with open(file,'r') as f:
        lines = f.readlines()
        line_num = len(lines) - 1
        # line_num = 5 # 先读5行试验
        data = []
        for i in range(1,line_num+1):
            dic = {}
            word_set = set([])
            word = lines[i].split()[3:]
            emotion = (int)(lines[i].split()[1])
            for w in word:
                if w in word_set:
                    dic[w] += 1
                else:
                    word_set.add(w)
                    dic[w] = 1
            for w in dic:
                dic[w] = dic[w]/len(word)    
            data.append(node(word_set,dic,emotion,word))
return data

train_data = read_data('train.txt')
test_data = read_data('test.txt')
# for i in range(len(test_data)):
#     print(test_data[i].word_set)
#     print(test_data[i].dic)
#     print(test_data[i].emotion)
#     print()

（3）贝叶斯函数

# 贝叶斯估计
def bayes(x,train_set):
    lemd = 1 # 拉普拉斯平滑
    p = [0]*7
    for line in train_set:
        px = 1 # 该句情感下的贝叶斯估计
        p_k = [0]*len(x) # x(k):第k个元素在train中出现的概率
        p_k_sum = 0 # x(k)之和
        for k in range(len(x)):
            if x[k] in line.word_set:
                p_k[k] = line.dic[x[k]]
            else:
                p_k[k] = 0
            p_k_sum += p_k[k]
        for k in range(len(x)):
            px = px * (p_k[k]+lemd)/(p_k_sum+len(x)*lemd)       
        p[line.emotion] += px / len(train_set) # 默认每个句子的概率都是1/n
    emo = 0
    for i in range(7):
        if p[i]>p[emo]:
            emo = i
    return emo

（4）计算正确率

# 计算正确率
correct_num = 0 # 正确个数
correct_rate = 0 # 正确率
for test in test_data:
    emo = bayes(test.word,train_data)
    if emo == test.emotion:
        correct_num += 1
    # print("pre",emo)
    # print("fact",test.emotion)
    # print(emo==test.emotion)
correct_rate = correct_num/len(test_data)
print(correct_rate)

实验结果及分析
输入：无
输出：

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/883243.html

贝叶斯算法-情感分类

发表评论

评论列表（0条）