能够用于tf-idf的语料库(python学习).

能够用于tf-idf的语料库(python学习).,第1张

您好,推荐使用CRAFT语料库

CRAFT(Colorado Richly Annotated Full-Text)语料库,中文名科罗拉多丰富语料注释库。CRAFT收录了97篇可公开获取全文的生物医学期刊文献,并将这些文章在语义和句法上都作了详尽的注释以作为自然语言处理(NLP)社区的生物医学研究资源。CRAFT基于9个常用的生物医学本体,从这97篇文献中识别了所有的生物学实体,这些本体包括:细胞类型本体,小分子化合物本体(CHEBI),NCBI分类法,蛋白质本体,序列本体,Entrez Gene数据库的条目,以及基因本体(Gene Ontology)的三个子条目。CRAFT语料库已被广泛应用于对文本挖掘工具的性能测试中。当然也可以用于TF-IDF方法。

TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency)。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。

台风的编号和名称直接在源码中有,但时间和地理位置我只能跟踪到

function totf(tfbh){

location.href( "Typhoon.aspx?id="+tfbh)

}

数据需要从aspx中拿到的,应该是存放到数据库的,页面上是拿不到的

我认为可以通过循环模拟发送请求Typhoon.aspx?id="+XXX,然后通过解析response包的方式可以获得详细的信息

下面一个页面是讲模拟发送请求的

http://tidus2005.javaeye.com/blog/195544

希望对你有帮助

我写了一段获得一组数据的代码

//get Typhoon content by param

public static String getTyphoon(String param) {

URL url = null

try {

url = new URL(param)

} catch (MalformedURLException e) {

e.printStackTrace()

}

HttpURLConnection connection = null

InputStream is = null

try {

connection = (HttpURLConnection) url.openConnection()

is = connection.getInputStream()

} catch (IOException e) {

e.printStackTrace()

}

BufferedInputStream bis = new BufferedInputStream(is)

int len = 0

byte[] buf_all = new byte[0]

try {

while (true) {

byte[] buf1 = new byte[4096]

byte[] buf2 = buf_all

len = bis.read(buf1)

if(len <= 0){

break

}

buf_all = new byte[len+buf2.length]

System.arraycopy(buf2, 0, buf_all, 0, buf2.length)

System.arraycopy(buf1, 0, buf_all, buf2.length, len)

}

} catch (IOException e) {

e.printStackTrace()

}

String content = null

try {

content = new String(buf_all, "utf-8")

} catch (UnsupportedEncodingException e) {

e.printStackTrace()

}

int startIndex = content.indexOf("var ary0=")+9

content = content.substring(startIndex)

int endIndex = content.indexOf("var aryyb0=")

content = content.substring(0, endIndex)

return content

}

得到的结果是这样的:

[['200906','2009-07-19 20:00:00','23.8','109.6','','15','','','260','','','54440','','莫拉菲','Molave','7'],

['200906','2009-07-19 15:00:00','23.5','111','993','15','25','西北西','260','','','54439','','莫拉菲','Molave','7'],

['200906','2009-07-19 14:00:00','23.3','111.2','','18','','','260','','','54438','','莫拉菲','Molave','8'],

['200906','2009-07-19 13:00:00','23.3','111.5','990','18','25','西北西','260','','','54437','','莫拉菲','Molave','8'],

['200906','2009-07-19 12:00:00','23.2','111.8','990','18','25','西北西','260','','','54436','','莫拉菲','Molave','8'],

['200906','2009-07-19 11:00:00','23.2','112.1','987','18','25','西北西','260','','','54435','','莫拉菲','Molave','8'],

['200906','2009-07-19 10:00:00','23.2','112.4','987','18','25','西北西','260','','','54434','','莫拉菲','Molave','8'],

['200906','2009-07-19 09:00:00','23','112.6','987','20','25','西北西','260','','','54433','','莫拉菲','Molave','8'],

['200906','2009-07-19 08:00:00','22.9','112.9','987','20','','','260','','','54432','','莫拉菲','Molave','8'],

['200906','2009-07-19 07:00:00','22.9','113.2','985','23','25','西北西','260','','','54431','','莫拉菲','Molave','9'],

['200906','2009-07-19 06:00:00','22.8','113.4','982','25','25','西北西','260','','','54430','','莫拉菲','Molave','10'],

['200906','2009-07-19 05:00:00','22.7','113.7','980','28','25','西北西','260','','','54429','','莫拉菲','Molave','10'],

['200906','2009-07-19 04:00:00','22.7','114','975','30','25','西北西','260','','','54428','','莫拉菲','Molave','11'],

['200906','2009-07-19 03:00:00','22.7','114.2','975','33','25','西北偏西','260','80','','54426','','莫拉菲','Molave','12'],

['200906','2009-07-19 02:00:00','22.6','114.5','','35','','','260','80','','54425','','莫拉菲','Molave','12'],

['200906','2009-07-19 01:00:00','22.5','114.5','970','35','28','西北西','260','80','','54424','','莫拉菲','Molave','12'],

['200906','2009-07-19 00:00:00','22.5','114.8','965','38','28','西北西','260','80','','54423','','莫拉菲','Molave','13'],

['200906','2009-07-18 23:00:00','22.4','115.1','','38','','','260','80','','54422','','莫拉菲','Molave','13'],

['200906','2009-07-18 22:00:00','22.3','115.5','965','38','25','西北西','260','80','','54421','','莫拉菲','Molave','13'],

['200906','2009-07-18 21:00:00','22.2','115.7','965','38','25','西北西','260','80','','54420','','莫拉菲','Molave','13'],

['200906','2009-07-18 20:00:00','22.2','116','','35','','','260','80','','54419','','莫拉菲','Molave','12'],

['200906','2009-07-18 19:00:00','22.2','116.2','970','35','25','西北偏西','260','80','','54418','','莫拉菲','Molave','12'],

['200906','2009-07-18 18:00:00','22.1','116.5','970','35','25','西北偏西','260','80','','54417','','莫拉菲','Molave','12'],

['200906','2009-07-18 17:00:00','22','116.7','970','35','25','西北西','260','80','','54416','','莫拉菲','Molave','12'],

['200906','2009-07-18 16:00:00','21.9','116.9','970','35','25','西北偏西','260','80','','54415','','莫拉菲','Molave','12'],

['200906','2009-07-18 15:00:00','21.8','117.1','970','35','25','西北偏西','260','80','','54414','','莫拉菲','Molave','12'],

['200906','2009-07-18 14:00:00','21.7','117.2','970','35','25','西北西','260','80','','54413','','莫拉菲','Molave','12'],

['200906','2009-07-18 13:00:00','21.7','117.4','970','35','25','西北西','260','80','','54412','','莫拉菲','Molave','12'],

['200906','2009-07-18 12:00:00','21.6','117.5','975','33','25','西北西','260','80','','54411','','莫拉菲','Molave','12'],

['200906','2009-07-18 11:00:00','21.6','117.7','975','33','25','西北西','260','80','','54410','','莫拉菲','Molave','12'],

['200906','2009-07-18 10:00:00','21.6','117.9','975','33','25','西北西','260','80','','54409','','莫拉菲','Molave','12'],

['200906','2009-07-18 09:00:00','21.5','118.2','975','33','25','西北西','260','80','','54408','','莫拉菲','Molave','12'],

['200906','2009-07-18 08:00:00','21.4','118.3','975','33','25','西北偏西','260','80','','54407','','莫拉菲','Molave','12'],

['200906','2009-07-18 07:00:00','21.4','118.5','975','33','25','西北西','260','80','','54406','','莫拉菲','Molave','12'],

['200906','2009-07-18 06:00:00','21.3','118.7','975','33','25','西北西','260','80','','54405','','莫拉菲','Molave','12'],

['200906','2009-07-18 05:00:00','21.2','119','975','33','','','260','60','','54404','','莫拉菲','Molave','12'],

['200906','2009-07-18 04:00:00','21.2','119.2','978','30','25','西北西','260','60','','54403','','莫拉菲','Molave','11'],

['200906','2009-07-18 03:00:00','21.1','119.4','978','30','25','西北偏西','260','60','','54402','','莫拉菲','Molave','11'],

['200906','2009-07-18 02:00:00','21','119.6','978','30','','','260','60','','54401','','莫拉菲','Molave','11'],

['200906','2009-07-18 01:00:00','21','120.1','978','30','25','西北偏西','260','60','','54400','','莫拉菲','Molave','11'],

['200906','2009-07-18 00:00:00','20.9','120.3','978','30','25','西北偏西','260','60','','54399','','莫拉菲','Molave','11'],

['200906','2009-07-17 23:00:00','20.8','120.5','978','30','20','西北偏西','260','60','','54398','','莫拉菲','Molave','11'],

['200906','2009-07-17 22:00:00','20.7','121','978','30','20','西北偏西','260','60','','54397','','莫拉菲','Molave','11'],

['200906','2009-07-17 21:00:00','20.7','121.2','978','30','20','西北偏西','260','60','','54396','','莫拉菲','Molave','11'],

['200906','2009-07-17 20:00:00','20.6','121.5','978','30','20','西北偏西','260','60','','54395','','莫拉菲','Molave','11'],

['200906','2009-07-17 19:00:00','20.4','121.8','980','28','20','西北西','260','60','','54394','','莫拉菲','Molave','10'],

['200906','2009-07-17 18:00:00','20.3','121.9','980','28','20','西北偏西','260','60','','54393','','莫拉菲','Molave','10'],

['200906','2009-07-17 17:00:00','20.2','122.1','980','28','20','西北偏西','200','50','','54392','','莫拉菲','Molave','10'],

['200906','2009-07-17 14:00:00','19.5','122.7','','25','','','200','50','','54391','','莫拉菲','Molave','10'],

['200906','2009-07-17 11:00:00','18.9','123.3','985','25','15','西北','200','50','','54390','','莫拉菲','Molave','10'],

['200906','2009-07-17 08:00:00','18.6','123.6','994','20','','','100','','','54389','','莫拉菲','Molave','8'],

['200906','2009-07-17 05:00:00','18.4','123.9','996','18','15','西北','100','','','54388','','莫拉菲','Molave','8'],

['200906','2009-07-17 02:00:00','17.9','124.1','996','18','15','西北','50','','','54387','','莫拉菲','Molave','8'],

['200906','2009-07-16 23:00:00','17.6','124.6','996','18','15','西北','','','','54386','','莫拉菲','Molave','8'],

['200906','2009-07-16 20:00:00','17.4','124.7','996','18','','','','','','54385','','莫拉菲','Molave','8']]

再下去字符串的拆分实在是太复杂了,不想写了

使用时只要参数为http://www.wztf121.com/Typhoon.aspx?id=

id后是台风的代码号,写一个循环就可以了


欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/sjk/6758694.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2023-03-27
下一篇 2023-03-27

发表评论

登录后才能评论

评论列表(0条)

保存