用JSOUP解析HTML，怎样删除掉其中的一段DIV标签及内容_随笔

用JSOUP解析HTML删除掉其中的一段DIV标签及内容的方法：

1、解析并提取 HTML 元素

如下：

File input = new File("D:/test.html")

Document doc = Jsoup.parse(input, "UTF-8", "url")

Element content = doc.getElementById("content")

Elements divs= content.getElementsByTag("div")

for (Element div: divs) {

String linkHref = link.attr("id")

String linkText = link.text()

}

2、用remove方法删除div

div.remove()

1、正则表达式去掉html标签代码如下：

/// <Header>/// 去除 HTML tag

/// </Header>

/// <param name="HTML">源</param>

/// <returns>结果</returns> public static string StripHTML(string HTML) //google "StripHTML" 得到{ string[] Regexs =

{

@"<script[^>]*?>.*?</script>",

@"<(\/\s*)?!?((\w+:)?\w+)(\w+(\s*=?\s*(([""'])(\\[""'tbnr]|[^\7])*?\7|\w+)|.{0})|\s)*?(\/\s*)?>",

@"([\r\n])[\s]+",

@"&(quot|#34)",

@"&(amp|#38)",

@"&(lt|#60)",

@"&(gt|#62)",

@"&(nbsp|#160)",

@"&(iexcl|#161)",

@"&(cent|#162)",

@"&(pound|#163)",

@"&(copy|#169)",

@"(\d+)",

@"-->",

@"<!--.*\n"

}

string[] Replaces =

{

"",

"\"",

"&",

"<",

">",

" ",

"\xa1", //chr(161),"\xa2", //chr(162),"\xa3", //chr(163),"\xa9", //chr(169),"",

"\r\n",

}

string s = HTML

for (int i = 0i <Regexs.Lengthi++)

{

s = new Regex(Regexs[i], RegexOptions.Multiline | RegexOptions.IgnoreCase).Replace(s, Replaces[i])

}

s.Replace("<", "")

s.Replace(">", "")

s.Replace("\r\n", "")

return s

}

2、可以直接复制到txt，然后保存成为.html，在浏览器中设置即可！

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/7334673.html

用JSOUP解析HTML，怎样删除掉其中的一段DIV标签及内容

发表评论

评论列表（0条）