功能描述:
首先判断HTML文件是否为当当图书的页面,不是则不处理 把图书标题,价格,作者,出版社等信息抽取出来存入文件 perl程序运行命令:perl programfile HTML_file_List原代码奉上:
#!/usr/bin/perluse HTML::Element;use HTML::TreeBuilder;use HTML::Parser;open DATAFH,">>data" || dIE "open file Failed:$!";select DATAFH;foreach my $file_name (@ARGV) { unless(-e $file_name){ print "$file_name文件不存在.\n"; next; } ##首先判断文件编码是不是UTF-8,如果不是要转换成UTF-8,因为我发现perl的HTML模块好像只能处理UTF-8编码的文件,处理GBK文件时会出现中文乱码。而我们从网上下载的页面几乎都是GBK编码。 system "enca","-L","zh_CN","-x","UTF-8",$file_name; ##用enca比用iconv的好处在于不需要管原文件是什么编码(即使本来就UFTF-8编码),直接转换成UTF-8编码就是了。 my $root = HTML::TreeBuilder->new; $root->parse_file($file_name); $Title=$root->find_by_tag_name('Title'); $str_Title=$Title->as_text(); if($str_Title=~qr/图书 - 当当网$/){ print "网页标题:$str_Title\n"; $div_h1=$root->look_down("_tag","div","class","h1_Title book_@R_404_6882@"); $h1=$div_h1->look_down('_tag','h1'); print "图书名称:",$h1->as_text(),"\n"; $div_info=$root->look_down("_tag","info book_r"); $price_d=$div_info->look_down("_tag","p","price_d"); print $price_d->as_text(),"\n"; $price_m=$div_info->look_down("_tag","price_m"); foreach my $node_r ($price_m->content_refs_List){ next if ref $$node_r; print $$node_r,"\n"; } $div_detail=$div_info->look_down("_tag","book_detailed","name","__Property_pub"); @elements=$div_detail->find_by_tag_name('p'); foreach(@elements){ print $_->as_text(),"\n"; } $clear=$div_detail->look_down("_tag","ul","clearfix"); @lis=$clear->content_List(); @spans=$lis[0]->content_List(); print $spans[1]->as_text(),"\n"; @spans=$lis[2]->content_List(); print $spans[1]->as_text(),"\n"; print $spans[2]->as_text(),"\n"; print "********************************************\n"; $|=1; } else{ print STDOUT "$file_name不是当当网图书信息页面。\n"; next; } $tree = $root->delete;}close DATAFH;
输出文件里的内容:
总结以上是内存溢出为你收集整理的Perl解析当当网图书信息页面全部内容,希望文章能够帮你解决Perl解析当当网图书信息页面所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)