python – NLTK的XMLCorpusReader可用于多文件语料库吗？_python

概述我正在尝试使用NLTK在 New York Times Annotated Corpus上做一些工作,其中包含每篇文章的XML文件(采用新闻行业文本格式NITF). 我可以解析单个文档,没有问题,如下： from nltk.corpus.reader import XMLCorpusReaderreader = XMLCorpusReader('nltk_data/corpora/nytimes 我正在尝试使用NLTK在 New York Times Annotated Corpus上做一些工作,其中包含每篇文章的XML文件(采用新闻行业文本格式NITF).

我可以解析单个文档,没有问题,如下：

from nltk.corpus.reader import XMLCorpusReaderreader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01',r'0000000.xml')

我需要在整个语料库上工作.
我试过这样做：

reader = XMLCorpusReader('corpora/nytimes',r'.*')

但这不会创建一个可用的读者对象.例如

len(reader.words())

回报

raise TypeError('Expected a single file IDentifIEr string')TypeError: Expected a single file IDentifIEr string

如何将此语料库读入NLTK？

我是NLTK的新手,所以非常感谢任何帮助.

解决方法我不是NLTK专家,所以可能有一种更简单的方法来做到这一点,但我天真地建议你使用 Python’s glob module.它支持Unix-stle路径名模式扩展.

from glob import globtexts = glob('nltk_data/corpora/nytimes/*')

这样就会以列表形式给出与指定表达式匹配的文件的名称.
然后,根据您想要/需要一次打开多少个,您可以：

from nltk.corpus.reader import XMLCorpusReaderfor item_path in texts:    reader = XMLCorpusReader('nltk_data/corpora/nytimes/',item_path)

正如@waffle paradox所建议的那样：你也可以缩小这个文本列表以满足你的特定需求.

总结

以上是内存溢出为你收集整理的python – NLTK的XMLCorpusReader可用于多文件语料库吗？全部内容，希望文章能够帮你解决python – NLTK的XMLCorpusReader可用于多文件语料库吗？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1194554.html

python – NLTK的XMLCorpusReader可用于多文件语料库吗？

发表评论

评论列表（0条）