因此,这是使用R来完成的一种方法,它使用的是Northwestern University lemmatizer
MorphAdorner。
lemmatize <- function(wordlist) { get.lemma <- function(word, url) { response <- GET(url,query=list(spelling=word,standardize="", wordClass="",wordClass2="", corpusConfig="ncf", # Nineteenth Century Fiction media="xml")) content <- content(response,type="text") xml <- xmlInternalTreeParse(content) return(xmlValue(xml["//lemma"][[1]])) } require(httr) require(XML) url <- "http://devadorner.northwestern.edu/maserver/lemmatizer" return(sapply(wordlist,get.lemma,url=url))}words <- c("is","am","was","are")lemmatize(words)# is am was are # "be" "be" "be" "be"
我怀疑您已经知道,正确的去词义化需要掌握词类(词性),上下文正确的拼写,并且还取决于所使用的语料库。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)