从R中的许多html文件创建一个语料库_html-js-css

概述我想为下载的 HTML文件的集合创建一个语料库,然后在R中读取它们以供将来的文本挖掘. 从本质上讲,这就是我想要做的： >从多个html文件创建语料库. 我尝试使用DirSource： library(tm)a<- DirSource("C:/test")b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=read 我想为下载的 HTML文件的集合创建一个语料库,然后在R中读取它们以供将来的文本挖掘.

从本质上讲,这就是我想要做的：

>从多个HTML文件创建语料库.

我尝试使用Dirsource：

library(tm)a<- Dirsource("C:/test")b<-Corpus(Dirsource(a),readerControl=List(language="eng",reader=readplain))

但它返回“无效的目录参数”

>立即从Corpus读取HTML文件.
不知道怎么做.
>解析它们,将它们转换为纯文本,删除标签.
很多人建议使用XML,但是,我找不到处理多个文件的方法.它们都是一个文件.

非常感谢.

解决方法这应该做到这一点.在这里,我的计算机上有一个HTML文件的文件夹(来自SO的随机样本),我用它们创建了一个语料库,然后是一个文档术语矩阵,然后完成了一些简单的文本挖掘任务.

# get datasetwd("C:/Downloads/HTML") # this folder has your HTML files HTML <- List.files(pattern="\.(htm|HTML)$") # get just .htm and .HTML files# load packageslibrary(tm)library(RCurl)library(XML)# get some code from github to convert HTML to textwriteChar(con="HTMLToText.R",(getURL(ssl.verifypeer = FALSE,"https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/HTMLToText/HTMLToText.R")))source("HTMLToText.R")# convert HTML to textHTML2txt <- lapply(HTML,HTMLToText)# clean out non-ASCII charactersHTML2txtclean <- sapply(HTML2txt,function(x) iconv(x,"latin1","ASCII",sub=""))# make corpus for text miningcorpus <- Corpus(VectorSource(HTML2txtclean))# process text...skipwords <- function(x) removeWords(x,stopwords("english"))funcs <- List(tolower,removePunctuation,removeNumbers,stripwhitespace,skipwords)a <- tm_map(a,PlainTextdocument)a <- tm_map(corpus,FUN = tm_reduce,tmFuns = funcs)a.dtm1 <- TermdocumentMatrix(a,control = List(wordLengths = c(3,10))) newstopwords <- findFreqTerms(a.dtm1,lowfreq=10) # get most frequent words# remove most frequent words for this corpusa.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] inspect(a.dtm2)# carry on with typical things that can Now be done,IE. cluster analysisa.dtm3 <- removeSparseTerms(a.dtm2,sparse=0.7)a.dtm.df <- as.data.frame(inspect(a.dtm3))a.dtm.df.scale <- scale(a.dtm.df)d <- dist(a.dtm.df.scale,method = "euclIDean") fit <- hclust(d,method="ward")plot(fit)

# just for fun... library(wordcloud)library(RcolorBrewer)m = as.matrix(t(a.dtm1))# get word counts in decreasing orderword_freqs = sort(colSums(m),decreasing=TRUE) # create a data frame with words and their frequencIEsdm = data.frame(word=names(word_freqs),freq=word_freqs)# plot wordcloudwordcloud(dm$word,dm$freq,random.order=FALSE,colors=brewer.pal(8,"Dark2"))

总结

以上是内存溢出为你收集整理的从R中的许多html文件创建一个语料库全部内容，希望文章能够帮你解决从R中的许多html文件创建一个语料库所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/web/1138034.html