SpaCy括号标记化：（LRB，RRB）对未正确标记化_随笔

SpaCy括号标记化：（LRB，RRB）对未正确标记化

使用自定义标记器将

r'b)b'

规则（请参见此regex演示）添加到中

infixes

。regex与a匹配，该a

之前带有任何单词char（字母，数字，

和Python
3中的其他稀有字符），并带有此类型的char。

您可以进一步自定义此正则表达式，因此很大程度上取决于您要与之匹配的上下文

。

查看完整的Python演示：

import spacyimport refrom spacy.tokenizer import Tokenizerfrom spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regexnlp = spacy.load('en_core_web_sm')def custom_tokenizer(nlp):    infixes = tuple([r"b)b"]) +  nlp.Defaults.infixes    infix_re = spacy.util.compile_infix_regex(infixes)    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,          suffix_search=suffix_re.search,          infix_finditer=infix_re.finditer,          token_match=nlp.tokenizer.token_match,          rules=nlp.Defaults.tokenizer_exceptions)nlp.tokenizer = custom_tokenizer(nlp)doc = nlp("Indonesia (CNN)AirAsia ")print([(t.text, t.lemma_, t.pos_, t.tag_) for t in doc] )

输出：

[('Indonesia', 'Indonesia', 'PROPN', 'NNP'), ('(', '(', 'PUNCT', '-LRB-'), ('CNN', 'CNN', 'PROPN', 'NNP'), (')', ')', 'PUNCT', '-RRB-'), ('AirAsia', 'AirAsia', 'PROPN', 'NNP')]

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5668364.html

SpaCy括号标记化：（LRB，RRB）对未正确标记化

发表评论

评论列表（0条）