之前一篇 「Catch Google Suggest Keyword 」,抓到的熱門中文詞常包含簡體中文,若要抓繁體中文的話,需要在 google suggestion web api 加入額外的參數。google suggestion web api 的網址可透過抓封包的軟體查看。
http://www.google.com.tw/complete/search?hl=zh-TW&client=suggest&js=true&qu=
如上,需要加入參數 hl=zh-TW ,否則預設會抓到簡體中文的 hot word,qu= 就輸入欲查詢字詞的 utf8 編碼。另外抓到的熱門詞也經常包含一些雜七雜八的詞,用人工篩選需要耗費很多的時間及精神。可以透過中文 wiki 將這些熱門詞輸入查詢,若找得到就保留下來,找不到就剔除。底下的 code 和之前那一篇大同小異,多了一道手續將熱門詞送往 wiki 篩選。需注意的是 perl 的 regular expression 的 pattern 如何處理 big5,可參考 JeffHung.Blog - Perl with UTF-8 mode ,目前是採用此篇文章後面評論員的方法 :
呼叫 metaquote 函式將 big5 傳入,傳回值當作 regular expression 的 pattern。
-
# gsuggest.pl - Google suggest
-
#
-
# c Copyright, 2004-2005 By John Bokma, http://johnbokma.com/
-
#
-
# Last updated: 2007-04-26 17:56:35 -0600
-
#
-
# §什acao§?google suggest keyword ao perl script
-
# ??-^?崦|raoIR?A?餘google suggest keywords
-
#
-
#!/usr/bin/perl -w
-
use strict;
-
use warnings;
-
-
#use URI::Escape;
-
use LWP::UserAgent;
-
use URI;
-
use Encode qw/encode decode/;
-
-
my $count = 1;
-
-
-
sub GetKeyWords
-
{
-
my ($key) = @_;
-
-
my $utf_data = encode("utf8", decode("big5", $key));
-
#print "n==> $utf_datan";
-
my $url = 'http://www.google.com.tw/complete/search?hl=zh-TW&client=suggest&js=true&qu=' . URI->new("$utf_data");
-
#uri_escape( join ' ' => @ARGV );
-
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0' );
-
my $response = $ua->get( $url );
-
-
if (!$response->is_success) {
-
return "";
-
}
-
-
my $content = $response->content;
-
-
# extract the information from the JavaScript file response
-
# note that the first " and the last " in the Arrays are
-
# excluded (if data is present)
-
my ( $element, $array1, $array2 ) = $content =~
-
/"(.+?)".+?Array("(.+?)").+?Array("(.+?)")/;
-
-
-
print "No resultsn";
-
#exit;
-
return "";
-
}
-
-
# split the first "array" on the item separator (the very first "
-
# and the very last " are already removed)
-
-
# remove the result(s) string from the number of results
-
# remove if present (no results known)
-
# and split the second "array"
-
# note that a negative limit is used to catch trailing empty
-
# results (caused by removing )
-
$array2 =~ s/ results?//g;
-
$array2 =~ s/ //g;
-
-
# make suggestion => result(s) pairs.
-
# note that the number of results is turned into a right justified string
-
-
# print the pairs in suggested order
-
#print "@$_n" for @pairs;
-
-
# print the pairs sorted on the number of results for each suggestion,
-
# largest "number" first since the numbers are right justified strings.
-
#print "nsorted:n";
-
sleep 1;
-
}
-
-
sub LookupWiki
-
{
-
my ($key) = @_;
-
-
print "$count : $key => ";
-
-
my $url = "http://zh.wikipedia.org/w/index.php?title=$key&variant=zh-tw";
-
my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0' );
-
my $response = $ua->get( $url );
-
next if (!$response->is_success);
-
my $content_big5 = encode("big5", decode("utf8", $response->content));
-
print "Xn";
-
print FOUT "X $keyn";
-
-
} else {
-
print "On";
-
print FOUT "O $keyn";
-
}
-
-
$count++;
-
}
-
-
$count = 1;
-
-
while() {
-
chomp ;
-
GetKeyWords($_);
-
}
-
close W1;
-
-
while() {
-
chomp ;
-
GetKeyWords($_);
-
}
-
close W2;
-
-
close FOUT;