开发者

MySQL SELECT DISTINCT with tolerance of variation

in my database I have a lot of entries that are very similar but not identical. For instance just two characters might be different such as:

Row1: "The weather is nice, see http://xyz56.com"

Row2: "The weather is nice, see http://xyz31.com"

I would like to get rid of these partial duplicates and just receive one result for these two rows. It does not matter which one it is, I would suggest to use the first one that appears.

Is there any feature I could utilize fr开发者_运维技巧om MySQL to do this efficiently? My first thought was to pull more data and do a comparism on the string, if the matching characters are over some threshold than ignore it. Downside is I will never know how many entries I have to pull from the database and it also is quiet inefficient since I have to compare each row with all the other rows (O(n²)).

Update: To be more specific on the use cases: The position of the variance is not always at the end of the string and it might also be more than just 2 characters that change. The string length varies with each row.


My suggestion would be to use Levenshtein distance, which is a measure for string similarity. To get MySQL to compute this directly, you will have to implement it in a stored procedure, an example is over here: http://www.artfulsoftware.com/infotree/queries.php#552.

There are also common implementations for PHP and Java.


You can use SOUNDEX.

SOUNDEX(str)
Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels.

mysql> SELECT SOUNDEX('Hello');
+---------------------------------------------------------+
| SOUNDEX('Hello')                                        |
+---------------------------------------------------------+
| H400                                                    |
+---------------------------------------------------------+
1 row in set (0.00 sec)

Source: http://www.tutorialspoint.com/mysql/mysql-string-functions.htm#operator_sounds-like

For example, in Oracle PL/SQL your strings have the same SOUNDEX and are SOUNDEX indistinguishable:

select soundex ('The weather is nice, see http://xyz56.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ56.COM')
-----------------------------------------------
T362                                           
1 row selected.

select soundex ('The weather is nice, see http://xyz31.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ31.COM')
-----------------------------------------------
T362                                           
1 row selected.


SELECT * FROM test GROUP BY SUBSTR(mytext, 1, 10);


Levenshtein Distance Algorithm for MySQL:

Please see: Implementation of Levenshtein distance for mysql/fuzzy search?

Levenshtein distance The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation may be insertion, deletion or substitution of one character. Jason Rust published this MySQL algorithm for it at http://www.codejanitor.com/wp/.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END; 

Helper function:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END; 

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation

SOURCE: http://www.merriampark.com/ldplsql.htm

CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
  (p_source_string   IN VARCHAR2,
   p_target_string   IN VARCHAR2)
  RETURN                NUMBER
  DETERMINISTIC
AS
  v_length_of_source    NUMBER := NVL (LENGTH (p_source_string), 0);
  v_length_of_target    NUMBER := NVL (LENGTH (p_target_string), 0);
  TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
  column_to_left        mytabtype;
  current_column        mytabtype;
  v_cost                NUMBER := 0;
BEGIN
  IF v_length_of_source = 0 THEN
    RETURN v_length_of_target;
  ELSIF v_length_of_target = 0 THEN
    RETURN v_length_of_source;
  ELSE
    FOR j IN 0 .. v_length_of_target LOOP
      column_to_left(j) := j;
    END LOOP;
    FOR i IN 1.. v_length_of_source LOOP
      current_column(0) := i;
      FOR j IN 1 .. v_length_of_target LOOP
        IF SUBSTR (p_source_string, i, 1) =
           SUBSTR (p_target_string, j, 1)
        THEN v_cost := 0;
        ELSE v_cost := 1;
        END IF;
        current_column(j) := LEAST (current_column(j-1) + 1,
                                    column_to_left(j) + 1,
                                    column_to_left(j-1) + v_cost);
      END LOOP;
      FOR j IN 0 .. v_length_of_target  LOOP
        column_to_left(j) := current_column(j);
      END LOOP;
    END LOOP;
  END IF;
  RETURN current_column(v_length_of_target);
END ld;

If you suppose to have a table called EMPLOYEES with a column named FIRST_NAME of type VARCHAR2, you can find easily the records which have Levenshtein Distance = 1 as follows:

SELECT *
  FROM employees alfa
 WHERE EXISTS
          (SELECT 'X'
             FROM employees beta
            WHERE ld (beta.first_name, alfa.first_name) = 1);

With this query you can show, in every row of the result set, the list of the first_name with Levenshtein Distance = 1:

SELECT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) = 1;

An example:

SELECT DISTINCT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) <= 2
          AND ld (b.first_name, a.first_name) > 0;

FIRST_NAME;FIRST_NAME_1
Jean;John
Nancy;Vance
Alana;Allan
Alana;Clara
Ellen;Eleni
John;Jean
Daniel;Danielle
Danielle;Daniel
Shelley;Shelli
Sundita;Nandita
Lisa;Luis
Stephen;Steven
Nanette;Janette
Diana;Alana
TJ;Ki
Luis;Lisa
Sarath;Sarah
Louise;Luis
Ki;TJ
Allan;Ellen
Luis;Louise
Den;Lex
Clara;Alana
Matthew;Mattea
Shelli;Shelley
Sarah;Sarath
Girard;Gerald
Vance;Nancy
Mattea;Martha
Allan;Alana
Nandita;Sundita
Ellen;Allan
Jean;Den
Eleni;Ellen
Gerald;Girard
Lex;Den
Janette;Nanette
Steven;Stephen
Mattea;Matthew
Den;Jean
Martha;Mattea
Alana;Diana
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜