开发者

How to store and display both ISO-8859-1 and UTF8 characters using perl

I am quite new to this, and this might be very easy to most people, but I have been struggling with this for days.

I'm writing a web crawler using perl, and the web crawler will extract certain information using LWP and some simple regular expression.

These information are saved in a mySQL database, which will be used on an android device. However, when I tested the web crawler, I realized some information are in Chinese (典華) using HTML numeric coding (&# 20856 ; &# 33775 ;), a开发者_StackOverflow中文版nd some information are using iso-8859-1 encoding (Zhífú). I solved the Chinese part using the PERL HTML::Entities library, which can be displayed when I set my console to utf8. However, the other letters (Zhífú) can only be displayed in iso-8859-1. If I try to display it in utf8, it will become Zh�f�. My question is:

  1. How could I determine which kind of encoding it use, and how can I display it differently?
  2. Would I be able to store it in mySQL directly, or I should process the information first (correct me if I am wrong, but my understanding is that mySQL use utf8 as the default language).
  3. Would this cause some kind of problem when I display it on an android device?

Thank you very much.


(Zhífú) can only be displayed in iso-8859-1. If I try to display it in utf8, it will become Zh�f�.

That's completely false. You can display "Zhífú" in both iso-8859-1 and UTF-8 terminals/applications/whatever. In fact, the fact that you see "Zhífú" is proof that it can be displayed in UTF-8, since this is a UTF-8 web page. If you're getting "Zh�f�", it's because you didn't encode the string using UTF-8 before giving it to the terminal/application/whatever that wants UTF-8.

Anyway, on to the question. I'm assuming that you're storing text, not HTML.

Decode every input! Encode every output! Then no problem.

         From the web
     5a 68 c3 ad 66 c3 ba
              |
            decode         Done for you by ->decoded_content (LWP::UA)
              |            or by ->content (WWW::Mech)
              v

         Decoded text      Manipulate as desired
            Zhífú

              |  
            encode         Done for you by DBI
              |  
              v
           Database
5a 68 c3 83 c2 ad 66 c3 83 c2 ba

In fact, the decoding should already be done for you by ->decoded_content, and the encoding should already be done for you by DBI, so I don't see why you're having trouble with this.

Same thing when you read from the database and output to the screen/whatever.

5a 68 c3 83 c2 ad 66 c3 83 c2 ba
           Database
              |
            decode         Done for you by DBI if you use
              |            the ..._utf8 flag for your driver
              v

         Decoded text      Manipulate as desired
            Zhífú

              |  
            encode         use open ':std', ':locale';
              |  
              v
            Screen
5a 68 c3 83 c2 ad 66 c3 83 c2 ba
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜