计算词和词组频率的Python nltk_python

概述我正在使用NLTK并尝试将单词短语计数到特定文档的某个长度以及每个短语的频率.我将字符串标记为获取数据列表. from nltk.util import ngramsfrom nltk.tokenize import sent_tokenize, word_tokenizefrom nltk.collocations import *data = ["this", "is", "not" 我正在使用NLTK并尝试将单词短语计数到特定文档的某个长度以及每个短语的频率.我将字符串标记为获取数据列表.

from nltk.util import ngramsfrom nltk.tokenize import sent_tokenize,word_tokenizefrom nltk.collocations import *data = ["this","is","not","a","test","this","real","test"]bigrams = ngrams(data,2)bigrams_c = {}for b in bigrams:    if b not in bigrams_c:        bigrams_c[b] = 1    else:        bigrams_c[b] += 1

上面的代码给出和输出如下：

(('is','this'),1)(('test',2)(('a','test'),3)(('this','is'),4)(('is','not'),1)(('real',2)(('is','real'),2)(('not','a'),3)

这是我正在寻找的部分内容.

我的问题是,是否有更方便的方法来说明长度为4或5的短语而不重复此代码只更改计数变量？

解决方法既然你标记了这个nltk,下面是如何使用nltk的方法来做到这一点,这些方法比标准python集合中的方法有更多的功能.

from nltk import ngrams,Freqdistall_counts = dict()for size in 2,3,4,5:    all_counts[size] = Freqdist(ngrams(data,size))

字典all_counts的每个元素都是ngram频率的字典.例如,您可以获得五个最常见的三元组：

all_counts[3].most_common(5)

总结

以上是内存溢出为你收集整理的计算词和词组频率的Python nltk全部内容，希望文章能够帮你解决计算词和词组频率的Python nltk所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1195005.html

计算词和词组频率的Python nltk

发表评论

评论列表（0条）