Remove duplicate words from field in mysql

2023-01-22 01:18 问答作者：

I am wondering if it is possible to remove duplicate text using a mysql query from one field, or if a problem like this would be better solved using PHP.

I have a database where users enter tags which can be searched upon. I have noticed that some tags have synonyms which I want to add to the field, but in some cases the synonym already exists, in other cases not. For example after I have updated the tags, I can end up with the following (the tags are separated by spaces only):-

pool swimming pool ocean sea water
swim ocean sea water water swim
swimming pool swim swimming pool 开发者_运维知识库swim

Is there a way to eliminate duplicate text from the same field so I end up with this?

pool swimming ocean sea water swim
ocean sea water swim
pool swim swimming

The model you describe (all tags into a single cell, separated by spaces) is not normalized so you can't expect to find a simple, performant and reliable way to do stuff with it from the database server (beyond reading the column). The way it's now, PHP is your only chance to do the cleanup you are planning to do, and you'll have to retrieve every row.

Is it too late to make a little change in the database design? If you store each tag into a separate row in a tag table you'd be able to do lots of stuff from plain SQL.

You may consider keeping one entry per tag instead of all tags as a string, so that you could do a select distinct among other things.

IMO, you're best to handle this with PHP

$uniqueTags = array_unique(explode(' ', $tagsFromDbColumn));

Here is another version,you generate a large enough number of rows so you can CROSS JOIN progressively for each word then just GROUP_CONCAT will once again concatenate the separate words with the added DISTINCT parameter.

A primary or unqiue key to group by would be better in case of identical rows.

 SELECT GROUP_CONCAT(DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(t.col, ' ', x.cifre), ' ', -1)) AS words
FROM t
INNER JOIN
(
    SELECT 1 + a.i + b.i * 10  cifre, b.i + a.i * 10  sute
    FROM (SELECT 0 AS i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) a
    CROSS JOIN (SELECT 0 AS i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) b
) x
ON (LENGTH(t.col) +1 - LENGTH(REPLACE(t.col, ' ', ''))) >= x.cifre
GROUP BY col

FIDDLE

If it's a real option,

Change your database design. I don't know about your time constraints so it may really not be an option, but consider which of these two paths you'd rather go down:

A couple of hours now redesigning the database, then writing, debugging and verifying a script that'll take all the values from the existing layout and put them in the new one.
Hours and hours later coming up with obscure queries for otherwise simple tasks that would take ten minutes to write a query for if the database was designed the way a relational database should be.

If it's really not an option though...

Let Sentence = the string of words.
Split Sentence up on every space and build an array out of it*. Store this as Words.
Let UniqueWords = an array of words with no duplicates.
For each Word in Words:
     If the Word is not in UniqueWords, put it in.

*a la PHP explode

You could also process it as a raw string (stopping to check at spaces or EOL), which may be faster, but if speed is important, your current database design should be far more concerning than this loop.

EDIT: I didn't see that you wanted it in a SQL query. I'm not sure it'd be possible using a query; perhaps a stored procedure will do. I don't know how to use those though.

+1 redesign, but if redesign is not an option now...

How many distinct tags are there? You might be able to do this using CASE and substring functions.

http://dev.mysql.com/doc/refman/5.0/en/case-statement.html

Try this:

DECLARE word VARCHAR(100);
DECLARE i,wordcount INT(10) DEFAULT 1;
SET word = 'pool swimming pool ocean sea water';
DROP TABLE IF EXISTS dupliword;
CREATE TEMPORARY TABLE dupliword(wordname VARCHAR(100));
SET wordcount = LENGTH(word) - LENGTH(REPLACE(word,' ',''))+1;
WHILE i <= wordcount DO
INSERT  INTO dupliword(wordname)
VALUES(SUBSTRING_INDEX(SUBSTRING_INDEX(word,' ',i),' ',-1));
SET i = i + 1;
END WHILE;
SELECT  REPLACE(GROUP_CONCAT(DISTINCT wordname),',',' ') FROM dupliword;

Remove duplicate words from field in mysql

If it's a real option,

If it's really not an option though...

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

If it's a real option,

If it's really not an option though...

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？