(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))
我的目标是获得如下的语法规则:
ROOT --> SS --> NP VPNP --> NN...
正如我所说,我需要递归地执行此 *** 作,而无需NLTK包或任何其他模块或正则表达式.这是我到目前为止所拥有的.参数树是在每个空间上分割的Penn-Tree.
def extract_rules(tree): tree = tree[1:-1] print("\n\n") if len(tree) == 0: return root_node = tree[0] print("Current Root: "+root_node) remaining_tree = tree[1:] right_sIDe = [] temp_tree = List(remaining_tree) print("remaining_tree: ",remaining_tree) symbol = remaining_tree.pop(0) print("Symbol: "+symbol) if symbol not in ["(",")"]: print("CASE: No Brackets") print("Rule: "+root_node+" --> "+str(symbol)) right_sIDe.append(symbol) elif symbol == "(": print("CASE: opening Bracket") print("Temp Tree: ",temp_tree) cursubtree_end = bracket_depth(temp_tree) print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end]) cursubtree_start = temp_tree.index(symbol) cursubtree = temp_tree[cursubtree_start:cursubtree_end+1] print("Subtree: ",cursubtree) rnode = extract_rules(cursubtree) if rnode: right_sIDe.append(rnode) print("Rule: "+root_node+" --> "+str(rnode)) print(right_sIDe) return root_nodedef bracket_depth(tree): counter = 0 position = 0 subtree = [] for i,char in enumerate(tree): if char == "(": counter = counter + 1 if char == ")": counter = counter - 1 if counter == 0 and i != 0: counter = i position = i break subtree = tree[0:position+1] return position
目前它适用于S的第一个子树,但所有其他子树都不会被递归解析.很高兴得到任何帮助..
解决方法 我倾向于保持尽可能简单,而不是试图重新发明你目前不允许使用的解析模块.就像是:string = ''' (ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP (DT a) (NN talk))) ) )'''def is_symbol_char(character): ''' Predicate to test if a character is valID for use in a symbol,extend as needed. ''' return character.isAlpha() or character in '-=$!?.'def tokenize(characters): ''' Process characters into a nested structure. The original string '(DT the)' is passed in as ['(','D','T',' ','t','h','e',')'] ''' tokens = [] while characters: character = characters.pop(0) if character.isspace(): pass # nothing to do,ignore it elif character == '(': # signals start of recursive analysis (push) characters,result = tokenize(characters) tokens.append(result) elif character == ')': # signals end of recursive analysis (pop) break elif is_symbol_char(character): # if it looks like a symbol,collect all # subsequents symbol characters symbol = '' while is_symbol_char(character): symbol += character character = characters.pop(0) # push unused non-symbol character back onto characters characters.insert(0,character) tokens.append(symbol) # Return whatever tokens we collected and any characters left over return characters,tokensdef extract_rules(tokens): ''' Recursively walk tokenized data extracting rules. ''' head,*tail = tokens print(head,'-->',*[x[0] if isinstance(x,List) else x for x in tail]) for token in tail: # recurse if isinstance(token,List): extract_rules(token)characters,tokens = tokenize(List(string))# After a successful tokenization,all the characters should be consumedassert not characters,"DIDn't consume all the input!"print('Tokens:',tokens[0],'Rules:',sep='\n\n',end='\n\n')extract_rules(tokens[0])
OUTPUT
Tokens:['ROOT',['S',['NP',['NN','Carnac'],['DT','the'],'Magnificent']],['VP',['VBD','gave'],'a'],'talk']]]]]Rules:ROOT --> SS --> NP VPNP --> NN DT NNNN --> CarnacDT --> theNN --> MagnificentVP --> VBD NPVBD --> gaveNP --> DT NNDT --> aNN --> talk
注意
我更改了原始树作为此子句:
(NP ((DT a) (NN talk)))
似乎不正确,因为它在网络上可用的语法树grapher上生成一个空节点,所以我简化为:
(NP (DT a) (NN talk))
根据需要调整.
@H_419_61@ 总结以上是内存溢出为你收集整理的python – 解析penn语法树以提取其语法规则全部内容,希望文章能够帮你解决python – 解析penn语法树以提取其语法规则所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)