开发者

mysql accent insensitive and dotted insensitive search

The Problem: I am trying to implement a search algorithm that shows the results even when dotted chars are provided. In other words: SELECT 'über' = 'uber' or SELECT 'mas' = 'maş' these results will return true. This would apply for every single char in the following array:

$arr = array('ş' => 's', 'ç' => 'c', 'ö' => 'o', 'ü' => 'u' and so on ...);

The Solution In My Mind: Along with the original column, I can have a particular column that stores the English names. So before storing 'über' to database, I will also convert it to 'uber' in php and then will store both 'über' (as the original) and 'uber' (as the searchable) to the database.

But then, even though I've searched for this the whole day, I still believe that there should be a simplier and cleaner way to accomplish the task since this would mean (more or less) to store the same data twice in the database. So guys, what do you think is the solution the only way to go or you know a better approach?

EDIT

For accent insensitive I've seen the posts on SO, they are working but since I am also considering the dotted chars, I had to ask this question.

EDIT2

I cannot post the whole table structure and code exactly for some reasons but I'll provide a close example.

myusers | CREATE TABLE `myusers` (
id int auto_increment not null primary key,
email varchar(100) COLLATE latin1_general_ci not null,
fullname varchar(75) COLLATE latin1_general_ci not null)
PRIMARY KEY('id')
) ENGINE=MyISAM AUTO_INCREMENET=2 DEFAULT CHARSET=latin1 COLLATE latin1_general_ci |

The above is the structure of the table. Here comes the inserts and selects:

INSERT INTO myusers (fullname) VALUES ('Agüeda');
INSERT INTO myusers (fullname) VALUES ('Agueda');

SELECT * FROM myusers WHERE fullname = 'Agüeda' COLLATE latin1_general_ci 

+----+-------+----------+
| id | email | fullname |
+----+-------+----------+
|  1 |       | Agüeda   |
+----+-------+----------+
1 row in set (0.00 sec)

SELECT * FROM myusers WHERE fullname = 'agueda' COLLATE latin1_general_ci 

+----+-------+-------开发者_JAVA百科---+
| id | email | fullname |
+----+-------+----------+
|  2 |       | Agueda   |
+----+-------+----------+
1 row in set (0.00 sec)

Well, the desired result is obviously when agueda is searched both 'Agueda' and 'Agüeda' will return, but that's not the case. As I mentioned above, I have created a new column and store the whole name in English characters and make the search from there as well. But still, it costs me a two times search (because I am also searching from the original columns which rank higher in the search result). There should be a better way...


1) Write your own collation. latin1_general_diacriticinsensitive. I wouldn't even know where to begin, though :).

2) Use regex and character groups: /[uü]ber/

3) The Solution In Your Mind. I'd personally use this, since design is all about compromise and this is a simple solution with just a 100% space overhead. Granted, the space overhead might eventually turn into a speed overhead, especially with MySQL, but that's to worry about later. This is also very easy to undo if need be.


Just use an appropriate collation. For instance:

create table test(
    foo text
) collate = utf8_unicode_ci;
insert into test values('Agüeda');
insert into test values('Agueda');
select * from test where foo = 'Agueda';

This gives your two rows.


Take a look at this post: https://stackoverflow.com/questions/500826

He has just the opposite issue you're facing. Look at the WHERE clause in the selected answer. Probably you could just use the _ci suffix and it'll work.

Let us know how this is resolved.


Well, instead of trying to replace them and run the search the x-times, I'd suggest using the mysql function LIKE i.e.

SELECT * FROM x WHERE search LIKE '%ber'

Where you have to replace the diacritics with"%.

EDIT: My mistake % replaces any number of characters. Use _ for a single char.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜