我的文本中有很多句子.如何使用nltk.ngrams进行处理?
这是我的代码:
sequence = nltk.tokenize.word_tokenize(raw) bigram = ngrams(sequence,2) freq_dist = nltk.Freqdist(bigram) prob_dist = nltk.MLEProbdist(freq_dist) number_of_bigrams = freq_dist.N()
但是,以上代码假定所有句子都是一个序列.但是,句子是分开的,我想一个句子的最后一个词与另一个句子的开始词无关.如何为这样的文本创建一个双字母组?我还需要基于`freq_dist的prob_dist和number_of_bigrams.
也有类似What are ngram counts and how to implement using nltk?的类似问题,但它们大多与单词序列有关.最佳答案您可以使用新的nltk.lm模块.这是一个示例,首先获取一些数据并将其标记化:
import osimport requestsimport io #codecsfrom nltk import word_tokenize,sent_tokenize # Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdfif os.path.isfile('language-never-random.txt'): with io.open('language-never-random.txt',enCoding='utf8') as fin: text = fin.read()else: url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt" text = requests.get(url).content.decode('utf8') with io.open('language-never-random.txt','w',enCoding='utf8') as fout: fout.write(text)# Tokenize the text.tokenized_text = [List(map(str.lower,word_tokenize(sent))) for sent in sent_tokenize(text)]
然后进行语言建模:
# Preprocess the tokenized text for 3-grams language modellingfrom nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.lm import MLEn = 3train_data,padded_sents = padded_everygram_pipeline(n,tokenized_text)model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.model.fit(train_data,padded_sents)
获取计数:
model.counts['language'] # i.e. Count('language')model.counts[['language']]['is'] # i.e. Count('is'|'language')model.counts[['language','is']]['never'] # i.e. Count('never'|'language is')
获取概率:
model.score('is','language'.split()) # P('is'|'language')model.score('never','language is'.split()) # P('never'|'language is')
加载笔记本时,kaggle平台上有一些问题,但在某些情况下,该笔记本应该可以很好地概述nltk.lm模块https://www.kaggle.com/alvations/n-gram-language-model-with-nltk. 总结
以上是内存溢出为你收集整理的python-如何获取句子文本中的双峰概率? 全部内容,希望文章能够帮你解决python-如何获取句子文本中的双峰概率? 所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)