Java中支持的爬虫框架有很多,比如WebMagic、Spider、Jsoup等。
Jsoup拥有十分方便的api来处理html文档,比如参考了DOM对象的文档遍历方法,参考了CSS选择器的用法等等,因此我们可以使用Jsoup快速地掌握爬取页面数据的技巧。
2、快速开始
1)分析HTML页面,明确哪些数据是需要抓取的
2)使用HttpClient读取HTML页面
HttpClient是一个处理Http协议数据的工具,使用它可以将HTML页面作为输入流读进java程序中.
3)使用Jsoup解析html字符串
通过引入Jsoup工具,直接调用parse方法来解析一个描述html页面内容的字符串来获得一个Document对象。该Document对象以 *** 作DOM树的方式来获得html页面上指定的内容。
3、保存爬取的页面数据
1)保存普通数据到数据库中
将爬取的数据封装进实体Bean中,并存到数据库内。
2)保存图片到服务器上
直接通过下载图片的方式将图片保存到服务器本地。
ResultSet rs = nulltry {
rs=conn.executeQuery(sql)
while(rs.next()){
id=rs.getInt("id")
}
} catch (Exception e) {
e.printStackTrace()
}finally{
rs.close()
conn.close()
}
台风的编号和名称直接在源码中有,但时间和地理位置我只能跟踪到function totf(tfbh){
location.href( "Typhoon.aspx?id="+tfbh)
}
数据需要从aspx中拿到的,应该是存放到数据库的,页面上是拿不到的
我认为可以通过循环模拟发送请求Typhoon.aspx?id="+XXX,然后通过解析response包的方式可以获得详细的信息
下面一个页面是讲模拟发送请求的
http://tidus2005.javaeye.com/blog/195544
希望对你有帮助
我写了一段获得一组数据的代码
//get Typhoon content by param
public static String getTyphoon(String param) {
URL url = null
try {
url = new URL(param)
} catch (MalformedURLException e) {
e.printStackTrace()
}
HttpURLConnection connection = null
InputStream is = null
try {
connection = (HttpURLConnection) url.openConnection()
is = connection.getInputStream()
} catch (IOException e) {
e.printStackTrace()
}
BufferedInputStream bis = new BufferedInputStream(is)
int len = 0
byte[] buf_all = new byte[0]
try {
while (true) {
byte[] buf1 = new byte[4096]
byte[] buf2 = buf_all
len = bis.read(buf1)
if(len <= 0){
break
}
buf_all = new byte[len+buf2.length]
System.arraycopy(buf2, 0, buf_all, 0, buf2.length)
System.arraycopy(buf1, 0, buf_all, buf2.length, len)
}
} catch (IOException e) {
e.printStackTrace()
}
String content = null
try {
content = new String(buf_all, "utf-8")
} catch (UnsupportedEncodingException e) {
e.printStackTrace()
}
int startIndex = content.indexOf("var ary0=")+9
content = content.substring(startIndex)
int endIndex = content.indexOf("var aryyb0=")
content = content.substring(0, endIndex)
return content
}
得到的结果是这样的:
[['200906','2009-07-19 20:00:00','23.8','109.6','','15','','','260','','','54440','','莫拉菲','Molave','7'],
['200906','2009-07-19 15:00:00','23.5','111','993','15','25','西北西','260','','','54439','','莫拉菲','Molave','7'],
['200906','2009-07-19 14:00:00','23.3','111.2','','18','','','260','','','54438','','莫拉菲','Molave','8'],
['200906','2009-07-19 13:00:00','23.3','111.5','990','18','25','西北西','260','','','54437','','莫拉菲','Molave','8'],
['200906','2009-07-19 12:00:00','23.2','111.8','990','18','25','西北西','260','','','54436','','莫拉菲','Molave','8'],
['200906','2009-07-19 11:00:00','23.2','112.1','987','18','25','西北西','260','','','54435','','莫拉菲','Molave','8'],
['200906','2009-07-19 10:00:00','23.2','112.4','987','18','25','西北西','260','','','54434','','莫拉菲','Molave','8'],
['200906','2009-07-19 09:00:00','23','112.6','987','20','25','西北西','260','','','54433','','莫拉菲','Molave','8'],
['200906','2009-07-19 08:00:00','22.9','112.9','987','20','','','260','','','54432','','莫拉菲','Molave','8'],
['200906','2009-07-19 07:00:00','22.9','113.2','985','23','25','西北西','260','','','54431','','莫拉菲','Molave','9'],
['200906','2009-07-19 06:00:00','22.8','113.4','982','25','25','西北西','260','','','54430','','莫拉菲','Molave','10'],
['200906','2009-07-19 05:00:00','22.7','113.7','980','28','25','西北西','260','','','54429','','莫拉菲','Molave','10'],
['200906','2009-07-19 04:00:00','22.7','114','975','30','25','西北西','260','','','54428','','莫拉菲','Molave','11'],
['200906','2009-07-19 03:00:00','22.7','114.2','975','33','25','西北偏西','260','80','','54426','','莫拉菲','Molave','12'],
['200906','2009-07-19 02:00:00','22.6','114.5','','35','','','260','80','','54425','','莫拉菲','Molave','12'],
['200906','2009-07-19 01:00:00','22.5','114.5','970','35','28','西北西','260','80','','54424','','莫拉菲','Molave','12'],
['200906','2009-07-19 00:00:00','22.5','114.8','965','38','28','西北西','260','80','','54423','','莫拉菲','Molave','13'],
['200906','2009-07-18 23:00:00','22.4','115.1','','38','','','260','80','','54422','','莫拉菲','Molave','13'],
['200906','2009-07-18 22:00:00','22.3','115.5','965','38','25','西北西','260','80','','54421','','莫拉菲','Molave','13'],
['200906','2009-07-18 21:00:00','22.2','115.7','965','38','25','西北西','260','80','','54420','','莫拉菲','Molave','13'],
['200906','2009-07-18 20:00:00','22.2','116','','35','','','260','80','','54419','','莫拉菲','Molave','12'],
['200906','2009-07-18 19:00:00','22.2','116.2','970','35','25','西北偏西','260','80','','54418','','莫拉菲','Molave','12'],
['200906','2009-07-18 18:00:00','22.1','116.5','970','35','25','西北偏西','260','80','','54417','','莫拉菲','Molave','12'],
['200906','2009-07-18 17:00:00','22','116.7','970','35','25','西北西','260','80','','54416','','莫拉菲','Molave','12'],
['200906','2009-07-18 16:00:00','21.9','116.9','970','35','25','西北偏西','260','80','','54415','','莫拉菲','Molave','12'],
['200906','2009-07-18 15:00:00','21.8','117.1','970','35','25','西北偏西','260','80','','54414','','莫拉菲','Molave','12'],
['200906','2009-07-18 14:00:00','21.7','117.2','970','35','25','西北西','260','80','','54413','','莫拉菲','Molave','12'],
['200906','2009-07-18 13:00:00','21.7','117.4','970','35','25','西北西','260','80','','54412','','莫拉菲','Molave','12'],
['200906','2009-07-18 12:00:00','21.6','117.5','975','33','25','西北西','260','80','','54411','','莫拉菲','Molave','12'],
['200906','2009-07-18 11:00:00','21.6','117.7','975','33','25','西北西','260','80','','54410','','莫拉菲','Molave','12'],
['200906','2009-07-18 10:00:00','21.6','117.9','975','33','25','西北西','260','80','','54409','','莫拉菲','Molave','12'],
['200906','2009-07-18 09:00:00','21.5','118.2','975','33','25','西北西','260','80','','54408','','莫拉菲','Molave','12'],
['200906','2009-07-18 08:00:00','21.4','118.3','975','33','25','西北偏西','260','80','','54407','','莫拉菲','Molave','12'],
['200906','2009-07-18 07:00:00','21.4','118.5','975','33','25','西北西','260','80','','54406','','莫拉菲','Molave','12'],
['200906','2009-07-18 06:00:00','21.3','118.7','975','33','25','西北西','260','80','','54405','','莫拉菲','Molave','12'],
['200906','2009-07-18 05:00:00','21.2','119','975','33','','','260','60','','54404','','莫拉菲','Molave','12'],
['200906','2009-07-18 04:00:00','21.2','119.2','978','30','25','西北西','260','60','','54403','','莫拉菲','Molave','11'],
['200906','2009-07-18 03:00:00','21.1','119.4','978','30','25','西北偏西','260','60','','54402','','莫拉菲','Molave','11'],
['200906','2009-07-18 02:00:00','21','119.6','978','30','','','260','60','','54401','','莫拉菲','Molave','11'],
['200906','2009-07-18 01:00:00','21','120.1','978','30','25','西北偏西','260','60','','54400','','莫拉菲','Molave','11'],
['200906','2009-07-18 00:00:00','20.9','120.3','978','30','25','西北偏西','260','60','','54399','','莫拉菲','Molave','11'],
['200906','2009-07-17 23:00:00','20.8','120.5','978','30','20','西北偏西','260','60','','54398','','莫拉菲','Molave','11'],
['200906','2009-07-17 22:00:00','20.7','121','978','30','20','西北偏西','260','60','','54397','','莫拉菲','Molave','11'],
['200906','2009-07-17 21:00:00','20.7','121.2','978','30','20','西北偏西','260','60','','54396','','莫拉菲','Molave','11'],
['200906','2009-07-17 20:00:00','20.6','121.5','978','30','20','西北偏西','260','60','','54395','','莫拉菲','Molave','11'],
['200906','2009-07-17 19:00:00','20.4','121.8','980','28','20','西北西','260','60','','54394','','莫拉菲','Molave','10'],
['200906','2009-07-17 18:00:00','20.3','121.9','980','28','20','西北偏西','260','60','','54393','','莫拉菲','Molave','10'],
['200906','2009-07-17 17:00:00','20.2','122.1','980','28','20','西北偏西','200','50','','54392','','莫拉菲','Molave','10'],
['200906','2009-07-17 14:00:00','19.5','122.7','','25','','','200','50','','54391','','莫拉菲','Molave','10'],
['200906','2009-07-17 11:00:00','18.9','123.3','985','25','15','西北','200','50','','54390','','莫拉菲','Molave','10'],
['200906','2009-07-17 08:00:00','18.6','123.6','994','20','','','100','','','54389','','莫拉菲','Molave','8'],
['200906','2009-07-17 05:00:00','18.4','123.9','996','18','15','西北','100','','','54388','','莫拉菲','Molave','8'],
['200906','2009-07-17 02:00:00','17.9','124.1','996','18','15','西北','50','','','54387','','莫拉菲','Molave','8'],
['200906','2009-07-16 23:00:00','17.6','124.6','996','18','15','西北','','','','54386','','莫拉菲','Molave','8'],
['200906','2009-07-16 20:00:00','17.4','124.7','996','18','','','','','','54385','','莫拉菲','Molave','8']]
再下去字符串的拆分实在是太复杂了,不想写了
使用时只要参数为http://www.wztf121.com/Typhoon.aspx?id=
id后是台风的代码号,写一个循环就可以了
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)