EvoTalk

24 五月, 2007

修正 – Catch Google Suggest Keyword

Posted by: asd In: 工作日誌 ()

之前一篇 「Catch Google Suggest Keyword 」,抓到的熱門中文詞常包含簡體中文,若要抓繁體中文的話,需要在 google suggestion web api 加入額外的參數。google suggestion web api 的網址可透過抓封包的軟體查看。

http://www.google.com.tw/complete/search?hl=zh-TW&client=suggest&js=true&qu=

如上,需要加入參數 hl=zh-TW ,否則預設會抓到簡體中文的 hot word,qu= 就輸入欲查詢字詞的 utf8 編碼。另外抓到的熱門詞也經常包含一些雜七雜八的詞,用人工篩選需要耗費很多的時間及精神。可以透過中文 wiki 將這些熱門詞輸入查詢,若找得到就保留下來,找不到就剔除。底下的 code 和之前那一篇大同小異,多了一道手續將熱門詞送往 wiki 篩選。需注意的是 perl 的 regular expression 的 pattern 如何處理 big5,可參考 JeffHung.Blog - Perl with UTF-8 mode ,目前是採用此篇文章後面評論員的方法 :

呼叫 metaquote 函式將 big5 傳入,傳回值當作 regular expression 的 pattern。

PERL:
  1. # gsuggest.pl - Google suggest
  2. #
  3. # c Copyright, 2004-2005 By John Bokma, http://johnbokma.com/
  4. #
  5. # Last updated: 2007-04-26 17:56:35 -0600
  6. #
  7. # §什acao§?google suggest keyword ao perl script
  8. # ??-^?崦|raoIR?A?餘google suggest keywords
  9. #
  10. #!/usr/bin/perl -w
  11. use strict;
  12. use warnings;
  13.  
  14. #use URI::Escape;
  15. use LWP::UserAgent;
  16. use URI;
  17. use Encode qw/encode decode/;
  18.  
  19. my $count = 1;
  20.  
  21. open(FOUT, ">cht_filter.txt") || $!;
  22.  
  23. sub GetKeyWords
  24. {
  25. my ($key) = @_;
  26.  
  27. my $utf_data = encode("utf8", decode("big5", $key));
  28. #print "n==> $utf_datan";
  29. my $url = 'http://www.google.com.tw/complete/search?hl=zh-TW&client=suggest&js=true&qu=' . URI->new("$utf_data");
  30. #uri_escape( join ' ' => @ARGV );
  31. my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0' );
  32. my $response = $ua->get( $url );
  33.  
  34. if (!$response->is_success) {
  35. print "$url: ", $response->status_line;
  36. return "";
  37. }
  38.  
  39. my $content = $response->content;
  40.  
  41. # extract the information from the JavaScript file response
  42. # note that the first " and the last " in the Arrays are
  43. # excluded (if data is present)
  44. my ( $element, $array1, $array2 ) = $content =~
  45. /"(.+?)".+?Array("(.+?)").+?Array("(.+?)")/;
  46.  
  47. unless ( defined $array1 ) {
  48.  
  49. print "No resultsn";
  50. #exit;
  51. return "";
  52. }
  53.  
  54. # split the first "array" on the item separator (the very first "
  55. # and the very last " are already removed)
  56. my @suggestions = split /", "/, $array1;
  57.  
  58. # remove the result(s) string from the number of results
  59. # remove   if present (no results known)
  60. # and split the second "array"
  61. # note that a negative limit is used to catch trailing empty
  62. # results (caused by removing  )
  63. $array2 =~ s/ results?//g;
  64. $array2 =~ s/ //g;
  65. my @results = split /", "/, $array2, -1;
  66.  
  67. # make suggestion => result(s) pairs.
  68. # note that the number of results is turned into a right justified string
  69. my @pairs = map { [ sprintf ( "%12s", shift @results ) => $_ ] } @suggestions;
  70.  
  71. # print the pairs in suggested order
  72. #print "@$_n"  for @pairs;
  73.  
  74. # print the pairs sorted on the number of results for each suggestion,
  75. # largest "number" first since the numbers are right justified strings.
  76. #print "nsorted:n";
  77. LookupWiki("$$_[1]") for sort { $b->[0] cmp $a->[0] } @pairs;
  78. }
  79.  
  80. sub LookupWiki
  81. {
  82. my ($key) = @_;
  83.  
  84. print "$count : $key  =>  ";
  85.  
  86. $key =~ s/s+/_/g;
  87. my $url = "http://zh.wikipedia.org/w/index.php?title=$key&variant=zh-tw";
  88. my $ua = LWP::UserAgent->new( agent => 'Mozilla/5.0' );
  89. my $response = $ua->get( $url );
  90. next if (!$response->is_success);
  91. my $content_big5 = encode("big5", decode("utf8", $response->content));
  92. my $regex1 = quotemeta('沒有找到標題');
  93. my $regex2 = quotemeta('維基百科目前還沒有與此同名的條目');
  94. if ($content_big5  =~ m/$regex1/o || $content_big5  =~ m/$regex2/o) {
  95. print "Xn";
  96. print FOUT "X $keyn";
  97.  
  98. } else {
  99. print "On";
  100. print FOUT "O $keyn";
  101. }
  102.  
  103. $count++;
  104. }
  105.  
  106. $count = 1;
  107. open(W1, "1w.txt") || die $!;
  108.  
  109. while() {
  110. GetKeyWords($_);
  111. }
  112. close W1;
  113.  
  114. open(W2, "2w.txt") || die $!;
  115. while() {
  116. GetKeyWords($_);
  117. }
  118. close W2;
  119.  
  120. close FOUT;

Tags:

Releated Posts



1 Response to "修正 – Catch Google Suggest Keyword"

1 | Isabella

五月 24th, 2010 at 2:51

Avatar

Three million members and counting: nude personals, adult photo personals.

Comment Form