开发者

finding duplicate fields based on accent

ok, this is bugging me. i got a phonebook DB from a client where some of the results containts accented names,

and by some i mean mainly the city field,or category. which makes my query results look ridiculous.

DB Charset: UTF-8

for example:

CompanyName | City | etc...

DemoCompany | Hauptstraße 18 | Whatever

DemoCompany | Hauptstrabe 18 | Whatever

the DB has around 360k records.... so manual checking is not an option. anyone has an idea how can i find the accented/not accented values ? something like a duplicate column check...

EDIT: when i query the table, i get results for both, that is not the problem. the problem is, when i display the results, some are displayed with accent, and some without.

EDIT:

CREATE TABLE `enc` (
  `company` varchar(255) DEFAULT NULL,
  `address` varchar(255) DEFAULT NULL,
  `postcode` varchar(255) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  `Telefon1` varchar(255) DEFAULT NULL,
  `Telefon2` varchar(255) DEFAULT NULL,
  `Telefon3` varchar(255) DEFAULT NULL,
  `Telefon4` varchar(255) DEFAULT NULL,
  `Telefon5` varchar(255) DEFAULT NULL,
  `Branche1` varchar(255) DEFAULT NULL,
  `Branche2` varcha开发者_如何转开发r(255) DEFAULT NULL,
  `Branche3` varchar(255) DEFAULT NULL,
  `Branche4` varchar(255) DEFAULT NULL,
  `Branche5` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$


You can start with something like this, that will show if there are rows that are exact duplicates of each other (and their count):

SELECT 
    CompanyName, City, etc... 
  , COUNT(*) AS DuplicateCount
FROM 
    TableToCheck
GROUP BY
    CompanyName, City, etc...            --- all columns except the Primary Key
HAVING 
    COUNT(*) > 1

If you want to find only duplicate addresses, you do something like this:

SELECT 
    Address
  , COUNT(*) AS DuplicateCount
FROM 
    TableToCheck
GROUP BY
    Address                     
HAVING 
    COUNT(*) > 1

Reading your question again, I think I misunderstood what you are asking. If you don't want to find duplicates (as there are not) but you want to find accented words (and replace them with unaccented perhaps):

The table you have now is probably using a case insensitive collation (like utf_general_ci or utf_unicode_ci), so you could copy the table into a new one that has same charset but a case sensitive collation, like utf_bin.

You could then create a list of accented characters and then write a query to check for this list in fields of your new table (this will be real slow):

SELECT nt.*
FROM NewTable AS nt 
  JOIN AccentedList AS al
WHERE nt.field LIKE CONCAT('%', al.AccentedChar, '%')
GROUP BY nt.PK

or run a query to REPLACE() those characters, like 'ß' with 'ss' for example.


You don't only have to consider accents but many other equivalent characters:

  • in German you can write 'ß' as 'ss', ä as 'ae', 'ü' as 'ue' and so on
  • in Italian and French you can search for letters without the accent but the accent is also sometimes substituted with an apostrophe (e.g., giocherò as giochero' in Italian)

If found write a function the compares the strings without considering these differences or you could try to match using a function that leverages phonetic differences.

Examples are (many databases implement them):

  • Soundex
  • Distance similarity
  • Jaro Winkler

Mysql has a SOUNDEX function, for the others you will have to define your own function (there are several examples on the web).

The results are not perfect but looking for similar entries will help a manual check.


I'm pretty sure this is a case for a phonetic search. You could create a temporary (possible memory located) table, insert the phonetic equivalent of the row into it, then take a count of how many are duplicates. This works very well for names (Meyer, Mayer) as well as Streets (Straße, Strasse).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜