开发者

How to query MySQL for exact length and exact UTF-8 characters

I have table with words dictionary in my language (latvian).

CREATE TABLE words (

value varchar开发者_高级运维(255) COLLATE utf8_unicode_ci DEFAULT NULL

) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

And let's say it has 3 words inside:

INSERT INTO words (value) VALUES ('tēja');

INSERT INTO words (value) VALUES ('vējš');

INSERT INTO words (value) VALUES ('feja');

What I want to do is I want to find all words that is exactly 4 characters long and where second character is 'ē' and third character is 'j'

For me it feels that correct query would be:

SELECT * FROM words WHERE value LIKE '_ēj_';

But problem with this query is that it returs not 2 entries ('tēja','vējš') but all three. As I understand it is because internally MySQL converts strings to some ASCII representation?

Then there is BINARY addition possible for LIKE

SELECT * FROM words WHERE value LIKE BINARY '_ēj_';

But this also does not return 2 entries ('tēja','vējš') but only one ('tēja'). I believe this has something to do with UTF-8 2 bytes for non ASCII chars?

So question:

What MySQL query would return my exact two words ('tēja','vējš')?

Thank you in advance


What MySQL query would return my exact two words ('tēja','vējš')?

SELECT * FROM words WHERE value LIKE '_ēj_' COLLATE utf8_bin;

The utf8_bin collation is not just diacritical-sensitive, but also case-sensitive. If you want to match only the letter-with-diacritical and you don't care about upper/lower case, you would have to find a utf_..._ci collation that doesn't treat e and ē as the same letter.

I can't immediately see one (there are plenty that don't collate ē at all, which would be okay if you only need case-sensitive matching on the non-diacritical letters). Interesting that the Latvian collation treats macron-letters as the same as plain letters, which you don't want (it knows š is different from s).

Anyway, whatever collation you end up with, you will want to put your tables in that collation rather than manually specifying it in a query, so that comparisons can be properly indexed.


You have to use proper collation.
Dunno for the latvian but here is the example for the german: http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html
to give you an idea

You can try some of the baltic collations

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜