How to clean UTF-8 data for MySQL

2023-03-24 13:15 问答作者：

I have UTF-8 text data from Twitter (so it my be very dirty). When input into mysql (database char set is utf8) some text get garbaged. I would like a way to clean data before putting it in.

Insert ignore search_tweets set id_str = 'pass1',text = 'RT @youpon_info: Youponです！この度はキャンペーン参加ありがとうございました。たくさんの方々にキャンペーンに参加して頂きました。'  ;
Insert ignore search_tweets set id_str = 'fail',text = 'RT @youpon_info: Youponです！この度はキャンペーン参加ありがとうございました。たくさんの方々にキャンペーンに参加して頂きました。また次のキャンペーンをすぐに予定しております！もう少'  ;
Insert ignore search_tweets set id_str = 'pass2',text = 'また次のキャンペーンをすぐに予定しております！もう少'  ;

fail.text = pass1.text + pass2.text and they both go in and come out of mysql fine. fail comes out as

RT @youpon开发者_开发问答_info: Youponã§ãï¼ãã®åº¦ã¯ãã£ã³ãã¼ã³åå ãããã¨ããããã¾ãããããããã®æ¹ã

I have done this with direct MySQL calls, although originally it was all done in Ruby datamapper and direct calls.

I would like to know how to clean the data so it goes in/comes out of MySQL the same. If possible a ruby solution would be nice but just knowing how to clean it would great.

It looks like the data being truncated. Do you have enough room in the text column for the data you are inserting?

I suspect varchar(n) will only accept n bytes, not n characters and the Japanese characters take at 3 bytes each. Mysql is known for silently truncating data that don't fit in and if it happens to be truncated in the middle of UTF-8 character, the reader may decide it's not correct UTF-8 and interpret it as ISO8859-1, which would result in what you are seeing.

Note, that in UTF-8, all characters of living languages fit in 3 bytes (with Chinese, Japanese and Korean all being in those that always need all 3) and the extended symbols and historical scripts need 4 bytes. So to stay on the safe side, the database must be willing to accept 4 times as many bytes as there are characters allowed.

继续阅读：ruby utf-8

How to clean UTF-8 data for MySQL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？