NLTK使用语料库标记西班牙语单词_随笔

NLTK使用语料库标记西班牙语单词

首先，您需要 从语料库中读取带标记的句子。
NLTK提供了一个很好的界面，不用担心来自不同语料库的不同格式。您可以简单地导入语料库，使用语料库对象函数来访问数据。请参阅http://nltk.googlepre.com/svn/trunk/nltk_data/index.xml。

然后，您必须 选择标记器的选择并训练标记器 。还有更多花哨的选项，但您可以从N-gram标记器开始。

然后，您可以使用标记器标记所需的句子。这是一个示例代码：

from nltk.corpus import cess_esp as cessfrom nltk import UnigramTagger as utfrom nltk import BigramTagger as bt# Read the corpus into a list, # each entry in the list is one sentence.cess_sents = cess.tagged_sents()# Train the unigram taggeruni_tag = ut(cess_sents)sentence = "Hola , esta foo bar ."# Tagger reads a list of tokens.uni_tag.tag(sentence.split(" "))# Split corpus into training and testing set.train = int(len(cess_sents)*90/100) # 90%# Train a bigram tagger with only training data.bi_tag = bt(cess_sents[:train])# evaluates on testing data remaining 10%bi_tag.evaluate(cess_sents[train+1:])# Using the tagger.bi_tag.tag(sentence.split(" "))

在大型语料库上训练标记器可能需要花费大量时间。无需在每次需要时训练标记器，而是将训练有素的标记器保存在文件中以供以后重用是很方便的。

请查看http://nltk.googlepre.com/svn/trunk/doc/book/ch05.html中的“
存储标记” 部分

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5646621.html

NLTK使用语料库标记西班牙语单词

发表评论

评论列表（0条）