perl – 比较日文字符的问题

perl – 比较日文字符的问题,第1张

概述我正在努力使用 HTML::TokeParser 解析包含日语字符的HTML文档. 这是我的代码: use utf8;use Encode qw(decode encode is_utf8);use Encode::Guess;use Data::Dumper;use LWP::UserAgent;use HTTP::Cookies;use Cwd;use HTML::TokePa 我正在努力使用 HTML::TokeParser
解析包含日语字符的HTML文档.

这是我的代码:

use utf8;use Encode qw(decode encode is_utf8);use Encode::Guess;use Data::Dumper;use LWP::UserAgent;use http::cookies;use Cwd;use HTML::TokeParser;my $local_dir = getcwd;my $browser = LWP::UserAgent->new();my $cookie_jar = http::cookies->new(   file     => $local_dir . "/cookies.lwp",autosave => 1,);$browser->cookie_jar( $cookie_jar );push @{ $browser->requests_redirectable },'POST';$browser->requests_redirectable;my $response = $browser->get("http://www.yahoo.co.jp/");my $HTML = $response->content;print $HTML;utf8::decode($HTML);my $p = HTML::TokeParser->new( $HTML );# dispatch table with subs to handle the different types of tokensmy %dispatch = (   S  => sub { $_[0]->[4] },# Start tag   E  => sub { $_[0]->[2] },# End tag   T  => sub { $_[0]->[1] },# Text   C  => sub { $_[0]->[1] },# Comment   D  => sub { $_[0]->[1] },# Declaration   PI => sub { $_[0]->[2] },# Process Instruction);while ( my $token = $p->get_tag('a') ) {        print $p->get_trimmed_text if $p->get_trimmed_text eq '社会的責任';        print "\n";}

这不会在我的终端上显示任何内容,但如果我只是打印$p-> get_trimmed_text,那么输出就可以了.

以下是与print $p-> get_trimmed_text相对应的几个hexdump行:

0000000 490a 746e 7265 656e 2074 7845 6c70 726f0000010 7265 81e3 e4ae 92ba 8fe6 e89b a8a1 a4e70000020 e3ba ab81 81e3 e3a4 8481 81e3 0aa6 9fe70000030 e5b3 9db7 81e9 e3bc 8982 9be5 e5bd 85860000040 a4e5 e396 ae81 83e3 e397 ad83 82e3 e3b40000050 ab83 83e3 e395 a182 83e3 e3bc 8c81 86e70000060 e68a ac9c 94e6 e6af b48f 320a e334 ab820000070 89e6 e380 ae81 b4e7 e885 8991 90e5 e68d0000080 8089 82e3 e692 a597 b8e5 e3b0 8a82 82e30000090 e3b3 bc83 82e3 e4b9 95bb abe7 e38b a68100000a0 81e3 e7a7 b9b4 bbe4 0a8b 83e3 e39e af8200000b0 83e3 e389 8a83 83e3 e3ab 8983 82e3 e38400000c0 8783 83e3 e38b bc83 82e3 e3ba ae81 81e300000d0 e58a 97be 81e3 e3aa af82 83e3 e3bc 9d8300000e0 83e3 e9b3 8d85 bfe4 0aa1 a8e8 e88e 96ab00000f0 bce4 e39a 8c80 83e3 e392 a983 83e3 e3aa0000100 bc83 b0e6 e58f 9d8b 88e5 e3a9 8d80 32350000110 e525 9986 9ce7 4e9f 5745 e50a a7a4 98e9

似乎比较不起作用.

我只能使用HTML :: TokeParser,因为这是服务器上安装的唯一模块,我无法安装任何其他模块.

解决方法 请参阅 ikegami’s answer. mine只是一种替代方法,无法解决代码的实际问题.

Unicode::Collate去救援!

请注意,我在下面的代码中添加了.

use Unicode::Collate;use open qw/:std :utf8/;my $Collator = Unicode::Collate->new();sub compare_strs{    my ( $str1,$str2 ) = @_;    # Treat vars as strings by quoting.     # Possibly incorrect/irrelevant approach.     return $Collator->cmp("$str1","$str2");}

注意:compare_strs子例程将返回1(当$str1大于$str2时)或0(当$str1等于$str2时)或-1(当$str1小于$str2时).

以下是完整的工作代码:

use strict;use warnings;use utf8;use Unicode::Collate;use open qw/:std :utf8/;use Encode qw(decode encode is_utf8);use Encode::Guess;use Data::Dumper;use LWP::UserAgent;use http::cookies;use Cwd;use HTML::TokeParser;my $local_dir = getcwd;my $browser = LWP::UserAgent->new();my $cookie_jar = http::cookies->new(   file     => $local_dir . "/cookies.lwp",);$browser->cookie_jar( $cookie_jar );push @{ $browser->requests_redirectable },'POST';$browser->requests_redirectable;my $Collator = Unicode::Collate->new();sub compare_strs{    my ( $str1,"$str2");}my $response   = $browser->get("http://www.yahoo.co.jp/");my $HTML = $response->content;#print $HTML;utf8::decode($HTML);my $p = HTML::TokeParser->new( $HTML );# dispatch table with subs to handle the different types of tokensmy %dispatch = (   S  => sub { $_[0]->[4] },# Process Instruction);my $string = '社会的責任';while ( my $token = $p->get_tag('a') ) {        my $text = $p->get_trimmed_text;        unless (compare_strs($text,$string)){          print $text;          print "\n";        }}

输出:

chankeypathak@perl:~/Desktop$perl test.pl 社会的責任
总结

以上是内存溢出为你收集整理的perl – 比较日文字符的问题全部内容,希望文章能够帮你解决perl – 比较日文字符的问题所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/langs/1209762.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-04
下一篇 2022-06-04

发表评论

登录后才能评论

评论列表(0条)

保存