{
String s = @"<Body>
<div>这里是要取出的文本A <img src=""/>这里是要取出的文本B <a href="">超链接里的文本不取出 </a>这里是要取出的文本C </div>
<body>"
Regex regex = new Regex( "(/?\\w+)[^>]*>([^<]*)<", RegexOptions.IgnoreCase )
MatchCollection ms = regex.Matches( s )
foreach( Match m in ms )
{
string tagName = m.Groups[1].Value.ToLower()
string text = m.Groups[2].Value.Trim()
if( tagName != "a" &&text.Length >0 )
Console.WriteLine( text )
}
}
结果:
这里是要取出的文本A
这里是要取出的文本B
这里是要取出的文本C
//我随便写了一个工具类,getRegexData就是那个方法,你可以根据你的需求稍加改动即可因为我使用的//URL 而不是HttpClient,所以数据是全部获取过来了,你自己改改吧!不懂再问我
package com.wdy.util
import java.io.IOException
import java.io.InputStream
import java.net.URL
import java.util.ArrayList
import java.util.List
import java.util.regex.Matcher
import java.util.regex.Pattern
/**
* 工具类
* @author WDY
*
*/
public class Tool {
public static void main(String[] args) {
System.out.println(getRegexData("<img[ ]*src.*?jpg\"", "<img src=\"img1.jpg\"><img src=\"img2.jpg\""))
try {
URL url=new URL("http://www.baidu.com")
String stringData=getStringFromInputStream(url.openStream())
System.out.println(stringData+"----------------------------------------")
System.out.println()
System.out.println(getRegexData("http://.{6,70}?(png|jpg)", stringData))
} catch (IOException e) {
e.printStackTrace()
}
}
/**
* 给一个正则表达式,和数据,将正则匹配到的数据全数取出来
*
* @param regex
* @param data
* @return List<String>
*/
public static List<String> getRegexData(String regex,String data){
Pattern pattern=Pattern.compile(regex)
Matcher matcher=pattern.matcher(data)
List<String> resultList=new ArrayList<String>()
int index=0//搜索的位置
String temp=""
/* 从指定位置查找,如果找到了,就继续执行下面的代码 */
while(matcher.find(index)){
temp=matcher.group()//将匹配到的数据取出来放到集合中去
resultList.add(temp)
index+=temp.length()//将查找位置放到此时找到的数据后面
System.out.println(index)
}
return resultList
}
/**
* 将输入流装成字符串
* @param is
* @return
*/
public static String getStringFromInputStream(InputStream is)throws IOException{
StringBuilder sbl=new StringBuilder()
byte[] buff=new byte[1024*8]
int len
int i=0
while((len=is.read(buff))!=-1){
sbl.append(new String(buff,0,len,"utf-8"))
System.out.println(i++)
}
System.out.println(sbl.length())
return sbl.toString()
}
}
只提取rufus,jenny?不行吧。没有规律啊。是把所有的标签内内容提取了吧。
如果是提取标签内的话这么写:Pattern pattern = Pattern.compile(">([^<]+)<")
Matcher macher =
pattern.matcher("<p><strong><br>Rufus</strong><br>Dan,
Jenny! Over here!
</p><p><strong>Jenny</strong><br>Hey, dad!
</p><p><strong>Rufus</strong><br>Hey,
hey! You made it. Welcome back! How was your weekend? How was your mom?
</p>")
while (macher.find())
{
System.out.println(macher.group(1))
}
打印结果:
Rufus
Dan, Jenny! Over here!
Jenny
Hey, dad!
Rufus
Hey, hey! You made it. Welcome back! How was your weekend? How was your mom?
麻烦采纳我的答案吧,(*^__^*) 嘻嘻……
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)