Python自然语言处理学习笔记(41):5.2 标注语料库

Python自然语言处理学习笔记(41):5.2 标注语料库,第1张

Python自然语言处理学习笔记(41):5.2 标注语料库

5.2   Tagged Corpora 标注语料库

 

 

Representing Tagged Tokens 表示标注的语言符号

By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

 

>>> tagged_token = nltk.tag.str2tuple('fly/NN')

>>> tagged_token

('fly', 'NN')

>>> tagged_token[0]

'fly'

>>> tagged_token[1]

'NN'

We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).

 

>>> sent = '''

... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN

... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC

... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS

... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB

... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT

... interest/NN of/IN both/ABX governments/NNS ''/'' ./.

... '''

>>> [nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),

('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

 

Reading Tagged Corpora 读取已标注的语料库

 

 

Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

Other corpora use a variety of formats for storing part-of-speech tags. NLTK's corpus readers provide a uniform interface so that you don't have to be concerned with the different file formats. In contrast with the file extract shown above, the corpus reader for the Brown Corpus represents the data as shown below. Note that part-of-speech tags have been converted to uppercase, since this has become standard practice(标准惯例) since the Brown Corpus was published.

 

>>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

>>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus:

 

>>> print nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

>>> nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

>>> nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

Not all corpora employ the same set of tags; see the tagset help functionality and the readme() methods mentioned above for documentation. Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset:

 

>>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

>>> nltk.corpus.treebank.tagged_words(simplify_tags=True)

[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.

 

>>> nltk.corpus.sinica_treebank.tagged_words()

[('\xe4\xb8\x80', 'Neu'), ('\xe5\x8f\x8b\xe6\x83\x85', 'Nad'), ...]

>>> nltk.corpus.indian.tagged_words()

[('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0', 'NN'),

('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8', 'NN'),

...]

>>> nltk.corpus.mac_morpho.tagged_words()

[('Jersei', 'N'), ('atinge', 'V'), ('m\xe9dia', 'N'), ...]

>>> nltk.corpus.conll2002.tagged_words()

[('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...]

>>> nltk.corpus.cess_cat.tagged_words()

[('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...]

If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. For example,Figure 5.1 shows data accessed using nltk.corpus.indian.

 

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/zaji/2086696.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-07-22
下一篇 2022-07-22

发表评论

登录后才能评论

评论列表(0条)

保存