开发者

Removing redundant data from mysql

I have a table in sql which looks like:

        1   1w10    U   UROK_HUMAN  IIGGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLISPCWVISATHCFID...   
        2   1w11    U   UROK_HUMAN  IIGGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLISPCWVISATHCFID...   
        3   1w12    U   UROK_HUMAN  IIGGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLISPCWVISATHCFID...   
        4   1w13    U   UROK_HUMAN  IIGGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLISPCWVISATHCFID...   
        5   1w14    U   UROK_HUMAN  IIGGEFTTIENQPWFAAIYRRHRGGSVTYVCGGSLISPCWVISATHCFID...
        6  开发者_高级运维 1w15    A   SYT4_RAT    GSPGISGGGGGIPSGRGELLVSLCYQSTTNTLTVVVLKARHLPKSDVSGL...   
        7   1w16    A   SYT4_RAT    GSPGISGGGGGIPSGRGELLVSLCYQSTTNTLTVVVLKARHLPKSDVSGL...   
        8   1w17    B   PDAA_BACSU  MKWMCSICCAAVLLAGGAAQAEAVPNEPINWGFKRSVNHQPPDAGKQLNS...   
        10  1w18    B   SACB_ACEDI  AGVPGFPLPSIHTQQAYDPQSDFTARWTRADALQIKAHSDATVAAGQNSL...   
        12  1w18    E   SACB_ACEDI  MKGGAGVPDLPSLDASGVRLAIVASSWHGKICDALLDGARKVAAGCGLDD...

I want to remove duplicate entries but leaving one of them. For instance, I want to keep the first row but remove 2,3,4,5. In short, I want to remove rows which have same column 4 value (here, UROK-HUMAN) but have different col2 and col3 values( here 1w10, 1w11 etc (col2) and U,A,B(col3)).

However, I do not want to remove entry have same col2 and col3 (1w18 B-E) which have same col4 value(SACB-ACEDI).

How can I write and sql statement to delete those rows? I tried to write like and did not work:

SELECT pdb, chain, unp, sekans, COUNT(*) AS ct
FROM protein
JOIN (SELECT DISTINCT(unp)
          FROM protein 
          GROUP by pdb) protein2 ON protein2.unp = protein.unp;

Thank you very much for your help.


Consider the alternative route instead: selecting those unique rows and inserting them into a temp table, then drop the old one and rename the new one. This circumvents the limitations on deleting from a table you select on, and it makes it far easier to test wether the results are correct.

INSERT INTO newtable SELECT min(pdb), chain, unp, sekans
FROM protein
GROUP by chain, unp, sekans

Note that if you have other columns that could have a different value in different rows (like the pdb), you should use an aggregate function (like min, max, sum, group_concat), or else the value that mysql will use for the new row will be undefined.


I'm not writing it for you but I'll tell how to do that. First write a SELECT query that would return all the IDs (I guess it is the first column, right?) you want to delete.

Then write a DELETE statement that would delete all of the rows with those IDs. Something like:

DELETE from protein where pdb in
(SELECT pdb from protein #here_goes_the_query_im_not_writing#)

So, in short, you first get all the IDs you want to delete and then you tell the DBMs to delete those IDs. That's all.

EDIT: Just adding a possible SQL to get all the duplicated rows but one. Not tested.

SELECT pdb FROM protein WHERE
pdb not in (
    SELECT pdb FROM (
        SELECT sekans, pdb FROM protein
        GROUP BY sekans)
    as T);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜