Deleting certain rows based on the count of a column in the mysql database
I have a very large database with about 120 Million records in one table.I have clean up the data in this table first before I divide it into several tables(possibly normalizing it). The columns of this table is as follows: "id(Primary Key), userId, Url, Tag " . This is basically a subset of the dataset from delicious website. As I said, each row has an id, userID a url and only "one" tag. So for example a bookmark in delicious website is composed of several tags for a single url, this corresponds to several lines of my database. for example: "id"; "user" ;"url" ;"tag" "38";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"ajax" "39";"12c2763095ec44e498f870ed67ee948d";"ht开发者_开发百科tp://forkjavascript.org/";"api" "40";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"javascript" "41";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"library" "42";"12c2763095ec44e498f870ed67ee948d";"http://forkjavascript.org/";"rails"
If I want to see the number of tags for each "distinct" url I run the below query.
SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url
Now I want to delete the records that have less than 5 tags associated with their urls. Does anyone knows the actual query that I have to run? thanks
delete from urltag where url in (SELECT DISTINCT url FROM urltag GROUP BY url HAVING count(tag) < 5)
should do it. but your request doesn't specifically take into account that a few different userIds could have submitted the same url...
You don't need SELECT DISTICT url, ... when you do GROUP BY url. I'd rewrite your query like this: from
SELECT DISTINCT url,tag,COUNT(tag) as "TagCount" FROM urltag GROUP BY url
to
SELECT url, COUNT(tag) as "TagCount" FROM urltag GROUP BY url
Placing tag
column in the select clause will not provide useful data. If a column is not mentioned in the GROUP BY clause, the values returned for it will be random, tipically min(tag).
So if you want to remove all rows containing urls for which less than 5 tags were associated, you can do this:
You can add a flag to your table, such as:
alter table urltag
add column todelete tinyint(4) not null default 0,
add key(todelete);
Then you can do
update urltag u
inner join (
SELECT url, count(tag) tagcount
FROM urltag GROUP BY url
)big on big.url = t.url
set t.todelete = 1
where big.tagcount < 5;
Then, just
delete from urltag where todelete = 1;
精彩评论