如何把HTML代码过滤一下？_随笔

这是一个删除发贴字符串中HTML代码的函数。具体解释如下：

Function KillHTMLLabel(str) '函数开始

Dim n,m,str2 '定义三个变量

n = inStr(str,"<") '找到第一个"<"所在的位置

m = inStr(str,">") '找到第一个">"所在的位置

str2 = str '把str的值赋给str2

Do while n >0 and n <m '如果n>0则说明找到了一个"<",如果n<m则说明"<"在">"的左边，则"<"和">"之间的字符串为HTML代码，需要过滤掉

str2 = Left(str2,n-1) &Mid(str2,m+1) '取"<"左边的字符串和">"右边的字符串并将他们连接在一起

n = inStr(str2,"<") '找到剩余字符串中第一个"<"所在的位置

m = inStr(str2,">") '找到剩余字符串中第一个">"所在的位置

Loop '循环

KillHTMLLabel = str2 '将过滤好的字符串返回。

End Function '结束。

这只是个简单的函数。

对于这样的字符串他是无法过滤的：><html>，这个字符串因为在运行第一遍循环时不符合条件，所以程序就会跳到loop后面，但是这个字符串中的html代码却不能过滤掉。

最好的办法是用正则表达式来过滤

过滤html标签代码如下：

public string checkStr(string html)

{

System.Text.RegularExpressions.Regex regex1 = new System.Text.RegularExpressions.Regex(@"<script[\s\S]+</script *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex2 = new System.Text.RegularExpressions.Regex(@" href *= *[\s\S]*script *:", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex3 = new System.Text.RegularExpressions.Regex(@" on[\s\S]*=", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex4 = new System.Text.RegularExpressions.Regex(@"<iframe[\s\S]+</iframe *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex5 = new System.Text.RegularExpressions.Regex(@"<frameset[\s\S]+</frameset *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex6 = new System.Text.RegularExpressions.Regex(@"\<img[^\>]+\>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex7 = new System.Text.RegularExpressions.Regex(@"</p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex8 = new System.Text.RegularExpressions.Regex(@"<p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

System.Text.RegularExpressions.Regex regex9 = new System.Text.RegularExpressions.Regex(@"<[^>]*>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

html = regex1.Replace(html, "")//过滤<script></script>标记

html = regex2.Replace(html, "")//过滤href=javascript: (<A>) 属性

html = regex3.Replace(html, " _disibledevent=")//过滤其它控件的on...事件

html = regex4.Replace(html, "")//过滤iframe

html = regex5.Replace(html, "")//过滤frameset

html = regex6.Replace(html, "")//过滤frameset

html = regex7.Replace(html, "")//过滤frameset

html = regex8.Replace(html, "")//过滤frameset

html = regex9.Replace(html, "")

html = html.Replace(" ", "")

html = html.Replace("</strong>", "")

html = html.Replace("<strong>", "")

return html

}

用正则表达式过滤html中所有Script 的方法：

1、定义正则表达式：

/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi

2、用正则表达式处理script的方法如下：

<html>

<head>

$(document).ready(function(){

$(".btn1").click(function(){

alert($("p").html())

})

</script>

</head>

<body>

<p>This is a paragraph.</p>

<button class="btn1" onclick="removeAllScript()">删除script</button>

</body>

</html>

function removeAllScript(obj){

//定义正则表达式，只要是存在于<script>和</script>之间的内容都会被删除

var SCRIPT_REGEX = /<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi

while (SCRIPT_REGEX.test(obj)) {//传入文档对象，获取整体内容

text = text.replace(SCRIPT_REGEX, "")//正则替换为空

}

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/7196105.html

如何把HTML代码过滤一下？

发表评论

评论列表（0条）