MySQL SELECT DISTINCT with tolerance of variation

2023-03-12 15:51 问答作者：

in my database I have a lot of entries that are very similar but not identical. For instance just two characters might be different such as:

Row1: "The weather is nice, see http://xyz56.com"

Row2: "The weather is nice, see http://xyz31.com"

I would like to get rid of these partial duplicates and just receive one result for these two rows. It does not matter which one it is, I would suggest to use the first one that appears.

Is there any feature I could utilize fr开发者_运维技巧om MySQL to do this efficiently? My first thought was to pull more data and do a comparism on the string, if the matching characters are over some threshold than ignore it. Downside is I will never know how many entries I have to pull from the database and it also is quiet inefficient since I have to compare each row with all the other rows (O(n²)).

Update: To be more specific on the use cases: The position of the variance is not always at the end of the string and it might also be more than just 2 characters that change. The string length varies with each row.

My suggestion would be to use Levenshtein distance, which is a measure for string similarity. To get MySQL to compute this directly, you will have to implement it in a stored procedure, an example is over here: http://www.artfulsoftware.com/infotree/queries.php#552.

There are also common implementations for PHP and Java.

You can use SOUNDEX.

SOUNDEX(str)
Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels.

mysql> SELECT SOUNDEX('Hello');
+---------------------------------------------------------+
| SOUNDEX('Hello')                                        |
+---------------------------------------------------------+
| H400                                                    |
+---------------------------------------------------------+
1 row in set (0.00 sec)

Source: http://www.tutorialspoint.com/mysql/mysql-string-functions.htm#operator_sounds-like

For example, in Oracle PL/SQL your strings have the same SOUNDEX and are SOUNDEX indistinguishable:

select soundex ('The weather is nice, see http://xyz56.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ56.COM')
-----------------------------------------------
T362                                           
1 row selected.

select soundex ('The weather is nice, see http://xyz31.com') from dual;

SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ31.COM')
-----------------------------------------------
T362                                           
1 row selected.

SELECT * FROM test GROUP BY SUBSTR(mytext, 1, 10);

Levenshtein Distance Algorithm for MySQL:

Please see: Implementation of Levenshtein distance for mysql/fuzzy search?

Levenshtein distance The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation may be insertion, deletion or substitution of one character. Jason Rust published this MySQL algorithm for it at http://www.codejanitor.com/wp/.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END;

Helper function:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END;

Levenshtein Distance Algorithm: Oracle PL/SQL Implementation

SOURCE: http://www.merriampark.com/ldplsql.htm

CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
  (p_source_string   IN VARCHAR2,
   p_target_string   IN VARCHAR2)
  RETURN                NUMBER
  DETERMINISTIC
AS
  v_length_of_source    NUMBER := NVL (LENGTH (p_source_string), 0);
  v_length_of_target    NUMBER := NVL (LENGTH (p_target_string), 0);
  TYPE mytabtype IS     TABLE OF NUMBER INDEX BY BINARY_INTEGER;
  column_to_left        mytabtype;
  current_column        mytabtype;
  v_cost                NUMBER := 0;
BEGIN
  IF v_length_of_source = 0 THEN
    RETURN v_length_of_target;
  ELSIF v_length_of_target = 0 THEN
    RETURN v_length_of_source;
  ELSE
    FOR j IN 0 .. v_length_of_target LOOP
      column_to_left(j) := j;
    END LOOP;
    FOR i IN 1.. v_length_of_source LOOP
      current_column(0) := i;
      FOR j IN 1 .. v_length_of_target LOOP
        IF SUBSTR (p_source_string, i, 1) =
           SUBSTR (p_target_string, j, 1)
        THEN v_cost := 0;
        ELSE v_cost := 1;
        END IF;
        current_column(j) := LEAST (current_column(j-1) + 1,
                                    column_to_left(j) + 1,
                                    column_to_left(j-1) + v_cost);
      END LOOP;
      FOR j IN 0 .. v_length_of_target  LOOP
        column_to_left(j) := current_column(j);
      END LOOP;
    END LOOP;
  END IF;
  RETURN current_column(v_length_of_target);
END ld;

If you suppose to have a table called EMPLOYEES with a column named FIRST_NAME of type VARCHAR2, you can find easily the records which have Levenshtein Distance = 1 as follows:

SELECT *
  FROM employees alfa
 WHERE EXISTS
          (SELECT 'X'
             FROM employees beta
            WHERE ld (beta.first_name, alfa.first_name) = 1);

With this query you can show, in every row of the result set, the list of the first_name with Levenshtein Distance = 1:

SELECT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) = 1;

An example:

SELECT DISTINCT a.first_name, b.first_name
  FROM    employees a
       INNER JOIN
          employees b
       ON ld (b.first_name, a.first_name) <= 2
          AND ld (b.first_name, a.first_name) > 0;

FIRST_NAME;FIRST_NAME_1
Jean;John
Nancy;Vance
Alana;Allan
Alana;Clara
Ellen;Eleni
John;Jean
Daniel;Danielle
Danielle;Daniel
Shelley;Shelli
Sundita;Nandita
Lisa;Luis
Stephen;Steven
Nanette;Janette
Diana;Alana
TJ;Ki
Luis;Lisa
Sarath;Sarah
Louise;Luis
Ki;TJ
Allan;Ellen
Luis;Louise
Den;Lex
Clara;Alana
Matthew;Mattea
Shelli;Shelley
Sarah;Sarath
Girard;Gerald
Vance;Nancy
Mattea;Martha
Allan;Alana
Nandita;Sundita
Ellen;Allan
Jean;Den
Eleni;Ellen
Gerald;Girard
Lex;Den
Janette;Nanette
Steven;Stephen
Mattea;Matthew
Den;Jean
Martha;Mattea
Alana;Diana

继续阅读：distinct

MySQL SELECT DISTINCT with tolerance of variation

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？