开发者

Deleting certain rows based on the count of a column in the mysql database

I have a very large database with about 120 Million records in one table.I have clean up the data in this table first before I divide it into several tables(possibly normalizing it). The columns of this table is as follows: "id(Primary Key), userId, Url, Tag " . This is basically a subset of the dataset from delicious website. As I said, each row has an id, userID a url and only "one" tag. So for example a bookmark in delicious website is composed of several tags for a single url, this corresponds to several lines of my database. for example: "id"; "user" ;"url" ;"tag" "38";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"ajax" "39";"12c2763095ec44e498f870ed67ee948d";"ht开发者_开发百科tp://forkjavascript.org/";"api" "40";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"javascript" "41";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"library" "42";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"rails"

If I want to see the number of tags for each "distinct" url I run the below query.

SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url

Now I want to delete the records that have less than 5 tags associated with their urls. Does anyone knows the actual query that I have to run? thanks


delete from urltag where url in (SELECT DISTINCT url FROM urltag GROUP BY url HAVING count(tag) < 5)

should do it. but your request doesn't specifically take into account that a few different userIds could have submitted the same url...


You don't need SELECT DISTICT url, ... when you do GROUP BY url. I'd rewrite your query like this: from

SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url

to

SELECT url, COUNT(tag) as "TagCount" FROM urltag GROUP BY url

Placing tag column in the select clause will not provide useful data. If a column is not mentioned in the GROUP BY clause, the values returned for it will be random, tipically min(tag).

So if you want to remove all rows containing urls for which less than 5 tags were associated, you can do this:

You can add a flag to your table, such as:

alter table urltag 
    add column todelete tinyint(4) not null default 0,
    add key(todelete);

Then you can do

update urltag u 
inner join (
    SELECT url, count(tag) tagcount 
    FROM urltag GROUP BY url
    )big on big.url = t.url
set t.todelete = 1
where big.tagcount < 5;

Then, just

delete from urltag where todelete = 1;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜