How to clean UTF-8 data for MySQL
I have UTF-8 text data from Twitter (so it my be very dirty). When input into mysql (database char set is utf8) some text get garbaged. I would like a way to clean data before putting it in.
Insert ignore search_tweets set id_str = 'pass1',text = 'RT @youpon_info: Youponです!この度はキャンペーン参加ありがとうございました。たくさんの方々にキャンペーンに参加して頂きました。' ;
Insert ignore search_tweets set id_str = 'fail',text = 'RT @youpon_info: Youponです!この度はキャンペーン参加ありがとうございました。たくさんの方々にキャンペーンに参加して頂きました。また次のキャンペーンをすぐに予定しております!もう少' ;
Insert ignore search_tweets set id_str = 'pass2',text = 'また次のキャンペーンをすぐに予定しております!もう少' ;
fail.text = pass1.text + pass2.text
and they both go in and come out of mysql fine. fail comes out as
RT @youpon开发者_开发问答_info: Youponã§ãï¼ãã®åº¦ã¯ãã£ã³ãã¼ã³åå ãããã¨ããããã¾ãããããããã®æ¹ã
I have done this with direct MySQL calls, although originally it was all done in Ruby datamapper and direct calls.
I would like to know how to clean the data so it goes in/comes out of MySQL the same. If possible a ruby solution would be nice but just knowing how to clean it would great.
It looks like the data being truncated. Do you have enough room in the text
column for the data you are inserting?
I suspect varchar(n)
will only accept n bytes, not n characters and the Japanese characters take at 3 bytes each. Mysql is known for silently truncating data that don't fit in and if it happens to be truncated in the middle of UTF-8 character, the reader may decide it's not correct UTF-8 and interpret it as ISO8859-1, which would result in what you are seeing.
Note, that in UTF-8, all characters of living languages fit in 3 bytes (with Chinese, Japanese and Korean all being in those that always need all 3) and the extended symbols and historical scripts need 4 bytes. So to stay on the safe side, the database must be willing to accept 4 times as many bytes as there are characters allowed.
精彩评论