python – 解析penn语法树以提取其语法规则

python – 解析penn语法树以提取其语法规则,第1张

概述我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则. (ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk)))))) 我的目标是获得如下的语法规则: ROOT --> SS --> NP VPNP --> NN... 正 我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则.

(ROOT (S    (NP (NN Carnac) (DT the) (NN Magnificent))    (VP (VBD gave) (NP ((DT a) (NN talk))))))

我的目标是获得如下的语法规则:

ROOT --> SS --> NP VPNP --> NN...

正如我所说,我需要递归地执行此 *** 作,而无需NLTK包或任何其他模块或正则表达式.这是我到目前为止所拥有的.参数树是在每个空间上分割的Penn-Tree.

def extract_rules(tree):    tree = tree[1:-1]    print("\n\n")    if len(tree) == 0:        return    root_node = tree[0]    print("Current Root: "+root_node)    remaining_tree = tree[1:]    right_sIDe = []    temp_tree = List(remaining_tree)    print("remaining_tree: ",remaining_tree)    symbol = remaining_tree.pop(0)    print("Symbol: "+symbol)    if symbol not in ["(",")"]:        print("CASE: No Brackets")        print("Rule: "+root_node+" --> "+str(symbol))        right_sIDe.append(symbol)    elif symbol == "(":        print("CASE: opening Bracket")        print("Temp Tree: ",temp_tree)        cursubtree_end = bracket_depth(temp_tree)        print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])        cursubtree_start = temp_tree.index(symbol)        cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]        print("Subtree: ",cursubtree)        rnode = extract_rules(cursubtree)        if rnode:            right_sIDe.append(rnode)            print("Rule: "+root_node+" --> "+str(rnode))    print(right_sIDe)    return root_nodedef bracket_depth(tree):    counter = 0    position = 0    subtree = []    for i,char in enumerate(tree):        if char == "(":            counter = counter + 1        if char == ")":            counter = counter - 1        if counter == 0 and i != 0:            counter = i            position = i            break    subtree = tree[0:position+1]    return position

目前它适用于S的第一个子树,但所有其他子树都不会被递归解析.很高兴得到任何帮助..

解决方法 我倾向于保持尽可能简单,而不是试图重新发明你目前不允许使用的解析模块.就像是:

string = '''    (ROOT        (S            (NP (NN Carnac) (DT the) (NN Magnificent))            (VP (VBD gave) (NP (DT a) (NN talk)))        )    )'''def is_symbol_char(character):    '''    Predicate to test if a character is valID    for use in a symbol,extend as needed.    '''    return character.isAlpha() or character in '-=$!?.'def tokenize(characters):    '''    Process characters into a nested structure.  The original string    '(DT the)' is passed in as ['(','D','T',' ','t','h','e',')']    '''    tokens = []    while characters:        character = characters.pop(0)        if character.isspace():            pass  # nothing to do,ignore it        elif character == '(':  # signals start of recursive analysis (push)            characters,result = tokenize(characters)            tokens.append(result)        elif character == ')':  # signals end of recursive analysis (pop)            break        elif is_symbol_char(character):            # if it looks like a symbol,collect all            # subsequents symbol characters            symbol = ''            while is_symbol_char(character):                symbol += character                character = characters.pop(0)            # push unused non-symbol character back onto characters            characters.insert(0,character)            tokens.append(symbol)    # Return whatever tokens we collected and any characters left over    return characters,tokensdef extract_rules(tokens):    ''' Recursively walk tokenized data extracting rules. '''    head,*tail = tokens    print(head,'-->',*[x[0] if isinstance(x,List) else x for x in tail])    for token in tail:  # recurse        if isinstance(token,List):            extract_rules(token)characters,tokens = tokenize(List(string))# After a successful tokenization,all the characters should be consumedassert not characters,"DIDn't consume all the input!"print('Tokens:',tokens[0],'Rules:',sep='\n\n',end='\n\n')extract_rules(tokens[0])

OUTPUT

Tokens:['ROOT',['S',['NP',['NN','Carnac'],['DT','the'],'Magnificent']],['VP',['VBD','gave'],'a'],'talk']]]]]Rules:ROOT --> SS --> NP VPNP --> NN DT NNNN --> CarnacDT --> theNN --> MagnificentVP --> VBD NPVBD --> gaveNP --> DT NNDT --> aNN --> talk

注意

我更改了原始树作为此子句:

(NP ((DT a) (NN talk)))

似乎不正确,因为它在网络上可用的语法树grapher上生成一个空节点,所以我简化为:

(NP (DT a) (NN talk))

根据需要调整.

@H_419_61@ 总结

以上是内存溢出为你收集整理的python – 解析penn语法树以提取其语法规则全部内容,希望文章能够帮你解决python – 解析penn语法树以提取其语法规则所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1193958.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-03
下一篇 2022-06-03

发表评论

登录后才能评论

评论列表(0条)

保存