用htmlParser怎么获取下面dd标签里面的内容_框架

htmlparser获取循环节点内容和单个标签内容的方法

htmlparser接口可用于提取分析html页面的内容。

本文只简单说明下如何利用htmlparser获取某个节点标签的内容，和获取循环节点的列表内容的个人总结。其它 *** 作方法，因网上已经有很多相关的帮助文档，在此不再重复说明。

大致思路：

1：定义orghtmlparserfilters的类型，确定需要获取的标签或内容范围。常用的HMLT filter类型有：AndFilter、HasAttributeFilter、HasChildFilter、HasParentFilter、LinkStringFilter、NotFilter、OrFilter、TagNameFilter等类型。

为了更好定位所需要查找的页面内容，可以用多个Filter进行组合定位，如：AndFilter andFilter = new AndFilter(tagFilter,hasChildFilter);

2：通过NodeList list = parserextractAllNodesThatMatch(andFilter);获取满足条件的节点列表。

3：对列表list进行循环遍历，在各个遍历里面，可以获取分析多个不同节点的内容，如获取某循环节点的内容和链接：

for(int i=0;i html = listelementAt(i)toHtml();

// 获取内容

TagNameFilter pFilter = new TagNameFilter(\"p\");

HasAttributeFilter pAttributeFilter = new HasAttributeFilter(\"class\",\"sms\");

AndFilter pAndFilter = new AndFilter(pFilter,pAttributeFilter);

pParser = ParsercreateParser(html, charset);

pList = pParserextractAllNodesThatMatch(pAndFilter);

Systemoutprintln(\"content:\"+pListelements()nextNode()toPlainTextString());

// 获取连接

TagNameFilter aFilter = new TagNameFilter(\"a\");

HasChildFilter aChildFilter = new HasChildFilter(new TagNameFilter(\"strong\"));

AndFilter aAndFilter = new AndFilter(aFilter,aChildFilter);

pParserreset();

pList = pParserextractAllNodesThatMatch(aAndFilter);

// Systemoutprintln(\"url:\"+pListelements()nextNode()toHtml());

LinkTag linkTag = (LinkTag)pListelements()nextNode();

Systemoutprintln(\"url's link:\"+linkTaggetLink());

Systemoutprintln(\"url's content:\"+linkTaggetLinkText());

}

至此，循环列表的内容，已经能全部获取到。

4：针对各种HMLT标签，htmlparser提供了各类标签的接口，为各类标签提供了各自个性化的方法，以便开发者更好更方便的调用。

举例如下：

如果想对某链接标签\" class=\"xxx\">进行内容的提取，可以通过正则表达式或字符串处理来获取到自己想要的内容。但更方便的，可以通过各种相应的xxxTag接口进行处理。如：LinkTag linkTag = (LinkTag)pListelements()nextNode();

Systemoutprintln(\"url's link:\"+linkTaggetLink());

Systemoutprintln(\"url's content:\"+linkTaggetLinkText();

使用jq就可以获取了。比如：

HTML：

<div>

</div>

JQ：

$("div p"); //获取div里面的所以p标签元

$("p")html(); //获取p标签里面的内容元素

jquery获取html文件的某个标签的值可以先用选择器，再调用val方法：

1、html代码如下：

<div class="something">Hello world</div>

</form>

</div>

2、用jquery选择器获取标签的值：

var text = $('#sa')find('input[name="FirstName"]')val();

3、获取到FirstName的值：

zhangsan

在网页刚流行起来的时候，提取html中的文本有一个简单的方法，就是将html文本（包含标记）中的所有以“<”符号开头到以“>”符号之间的内容去掉即可。

但对于现在复杂的网页而言，用这种方法提取出来的文本会有大量的空格、空行、script段落、还有一些html转义字符，效果很差。

下面用正则表达式来提取html中的文本，

代码的实现的思路是：

a、先将html文本中的所有空格、换行符去掉（因为html中的空格和换行是被忽略的）

b、将<head>标记中的所有内容去掉

c、将<script>标记中的所有内容去掉

d、将<style>标记中的所有内容去掉

e、将td换成空格，tr,li,br,p 等标记换成换行符

f、去掉所有以“<>”符号为头尾的标记去掉。

g、转换&，&nbps;等转义字符换成相应的符号

h、去掉多余的空格和空行

代码如下：

using System;

using SystemTextRegularExpressions;

namespace KwanhongUtilities

{

/// <summary>

/// HtmlToText 的摘要说明。

/// </summary>

public class HtmlToText

{

public string Convert(string source)

{

string result;

//remove line breaks,tabs

result = sourceReplace("\r", " ");

result = resultReplace("\n", " ");

result = resultReplace("\t", " ");

//remove the header

result = RegexReplace(result, "(<head>)(</head>)", stringEmpty, RegexOptionsIgnoreCase);

result = RegexReplace(result, @"<( )script([^>])>", "<script>", RegexOptionsIgnoreCase);

result = RegexReplace(result, @"(<script>)(</script>)", stringEmpty, RegexOptionsIgnoreCase);

//remove all styles

result = RegexReplace(result, @"<( )style([^>])>", "<style>", RegexOptionsIgnoreCase); //clearing attributes

result = RegexReplace(result, "(<style>)(</style>)", stringEmpty, RegexOptionsIgnoreCase);

//insert tabs in spaces of <td> tags

result = RegexReplace(result, @"<( )td([^>])>", " ", RegexOptionsIgnoreCase);

//insert line breaks in places of <br> and <li> tags

result = RegexReplace(result, @"<( )br( )>", "\r", RegexOptionsIgnoreCase);

result = RegexReplace(result, @"<( )li( )>", "\r", RegexOptionsIgnoreCase);

//insert line paragraphs in places of <tr> and <p> tags

result = RegexReplace(result, @"<( )tr([^>])>", "\r\r", RegexOptionsIgnoreCase);

result = RegexReplace(result, @"<( )p([^>])>", "\r\r", RegexOptionsIgnoreCase);

//remove anything thats enclosed inside < >

result = RegexReplace(result, @"<[^>]>", stringEmpty, RegexOptionsIgnoreCase);

//replace special characters:

result = RegexReplace(result, @"&", "&", RegexOptionsIgnoreCase);

result = RegexReplace(result, @" ", " ", RegexOptionsIgnoreCase);

result = RegexReplace(result, @"<", "<", RegexOptionsIgnoreCase);

result = RegexReplace(result, @">", ">", RegexOptionsIgnoreCase);

result = RegexReplace(result, @"&({2,6});", stringEmpty, RegexOptionsIgnoreCase);

//remove extra line breaks and tabs

result = RegexReplace(result, @" ( )+", " ");

result = RegexReplace(result, "(\r)( )+(\r)", "\r\r");

result = RegexReplace(result, @"(\r\r)+", "\r\n");

return result;

}

}//end class

}//end namespace

以上就是关于用htmlParser怎么获取下面dd标签里面的内容全部的内容，包括:用htmlParser怎么获取下面dd标签里面的内容、如何取到html标签里面的元素、jquery怎么获取html文件的某个标签的值等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/web/9559867.html

用htmlParser怎么获取下面dd标签里面的内容

发表评论

评论列表（0条）