MySQL SELECT DISTINCT with tolerance of variation
in my database I have a lot of entries that are very similar but not identical. For instance just two characters might be different such as:
Row1: "The weather is nice, see http://xyz56.com"
Row2: "The weather is nice, see http://xyz31.com"
I would like to get rid of these partial duplicates and just receive one result for these two rows. It does not matter which one it is, I would suggest to use the first one that appears.
Is there any feature I could utilize fr开发者_运维技巧om MySQL to do this efficiently? My first thought was to pull more data and do a comparism on the string, if the matching characters are over some threshold than ignore it. Downside is I will never know how many entries I have to pull from the database and it also is quiet inefficient since I have to compare each row with all the other rows (O(n²)).
Update: To be more specific on the use cases: The position of the variance is not always at the end of the string and it might also be more than just 2 characters that change. The string length varies with each row.
My suggestion would be to use Levenshtein distance, which is a measure for string similarity. To get MySQL to compute this directly, you will have to implement it in a stored procedure, an example is over here: http://www.artfulsoftware.com/infotree/queries.php#552.
There are also common implementations for PHP and Java.
You can use SOUNDEX.
SOUNDEX(str)
Returns a soundex string from str. Two strings that sound almost the same should have identical soundex strings. A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string. All non-alphabetic characters in str are ignored. All international alphabetic characters outside the A-Z range are treated as vowels.
mysql> SELECT SOUNDEX('Hello');
+---------------------------------------------------------+
| SOUNDEX('Hello') |
+---------------------------------------------------------+
| H400 |
+---------------------------------------------------------+
1 row in set (0.00 sec)
Source: http://www.tutorialspoint.com/mysql/mysql-string-functions.htm#operator_sounds-like
For example, in Oracle PL/SQL your strings have the same SOUNDEX and are SOUNDEX indistinguishable:
select soundex ('The weather is nice, see http://xyz56.com') from dual;
SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ56.COM')
-----------------------------------------------
T362
1 row selected.
select soundex ('The weather is nice, see http://xyz31.com') from dual;
SOUNDEX('THEWEATHERISNICE,SEEHTTP://XYZ31.COM')
-----------------------------------------------
T362
1 row selected.
SELECT * FROM test GROUP BY SUBSTR(mytext, 1, 10);
Levenshtein Distance Algorithm for MySQL:
Please see: Implementation of Levenshtein distance for mysql/fuzzy search?
Levenshtein distance The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other, where an operation may be insertion, deletion or substitution of one character. Jason Rust published this MySQL algorithm for it at http://www.codejanitor.com/wp/.
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
Helper function:
CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END;
Levenshtein Distance Algorithm: Oracle PL/SQL Implementation
SOURCE: http://www.merriampark.com/ldplsql.htm
CREATE OR REPLACE FUNCTION ld -- Levenshtein distance
(p_source_string IN VARCHAR2,
p_target_string IN VARCHAR2)
RETURN NUMBER
DETERMINISTIC
AS
v_length_of_source NUMBER := NVL (LENGTH (p_source_string), 0);
v_length_of_target NUMBER := NVL (LENGTH (p_target_string), 0);
TYPE mytabtype IS TABLE OF NUMBER INDEX BY BINARY_INTEGER;
column_to_left mytabtype;
current_column mytabtype;
v_cost NUMBER := 0;
BEGIN
IF v_length_of_source = 0 THEN
RETURN v_length_of_target;
ELSIF v_length_of_target = 0 THEN
RETURN v_length_of_source;
ELSE
FOR j IN 0 .. v_length_of_target LOOP
column_to_left(j) := j;
END LOOP;
FOR i IN 1.. v_length_of_source LOOP
current_column(0) := i;
FOR j IN 1 .. v_length_of_target LOOP
IF SUBSTR (p_source_string, i, 1) =
SUBSTR (p_target_string, j, 1)
THEN v_cost := 0;
ELSE v_cost := 1;
END IF;
current_column(j) := LEAST (current_column(j-1) + 1,
column_to_left(j) + 1,
column_to_left(j-1) + v_cost);
END LOOP;
FOR j IN 0 .. v_length_of_target LOOP
column_to_left(j) := current_column(j);
END LOOP;
END LOOP;
END IF;
RETURN current_column(v_length_of_target);
END ld;
If you suppose to have a table called EMPLOYEES with a column named FIRST_NAME of type VARCHAR2, you can find easily the records which have Levenshtein Distance = 1 as follows:
SELECT *
FROM employees alfa
WHERE EXISTS
(SELECT 'X'
FROM employees beta
WHERE ld (beta.first_name, alfa.first_name) = 1);
With this query you can show, in every row of the result set, the list of the first_name with Levenshtein Distance = 1:
SELECT a.first_name, b.first_name
FROM employees a
INNER JOIN
employees b
ON ld (b.first_name, a.first_name) = 1;
An example:
SELECT DISTINCT a.first_name, b.first_name
FROM employees a
INNER JOIN
employees b
ON ld (b.first_name, a.first_name) <= 2
AND ld (b.first_name, a.first_name) > 0;
FIRST_NAME;FIRST_NAME_1
Jean;John
Nancy;Vance
Alana;Allan
Alana;Clara
Ellen;Eleni
John;Jean
Daniel;Danielle
Danielle;Daniel
Shelley;Shelli
Sundita;Nandita
Lisa;Luis
Stephen;Steven
Nanette;Janette
Diana;Alana
TJ;Ki
Luis;Lisa
Sarath;Sarah
Louise;Luis
Ki;TJ
Allan;Ellen
Luis;Louise
Den;Lex
Clara;Alana
Matthew;Mattea
Shelli;Shelley
Sarah;Sarath
Girard;Gerald
Vance;Nancy
Mattea;Martha
Allan;Alana
Nandita;Sundita
Ellen;Allan
Jean;Den
Eleni;Ellen
Gerald;Girard
Lex;Den
Janette;Nanette
Steven;Stephen
Mattea;Matthew
Den;Jean
Martha;Mattea
Alana;Diana
精彩评论