Refining data stored in SQLite - how to join several contacts?

2022-12-31 07:48 问答作者：

I'm storing contacts between different elements. I want to eliminate elements of certain type and store new contacts of elements which were interconnected by the eliminated element.

Problem background

Imagine this problem. You have a water molecule which is in contact with other molecules (if the contact is a hydrogen bond, there can be 4 other molecules around my water). Like in the following picture (A, B, C, D are some other atoms and dots mean the contact).

 A   B
 |   |
 H   H
  . .
   O
  / \
 H   H
 .   .
 C   D

I have the information about all the dots and I need to eliminate the water in the center and create records describing contacts of A-C, A-D, A-B, B-C, B-D, and C-D.

Database structure

Currently, I have the following structure in the database:

Table atoms:

"id" integer PRIMARY KEY,
"amino" char(3) NOT NULL, (HOH for water or other value)
other columns identifying the atom

Table contacts:

"acceptor_id" integer NOT NULL, (the atom near to my hydrogen, here C or D)
"donor_id" integer NOT NULL, (here A or B)
"directness" char(1) NOT NULL, (this should be D for direct and W for water-mediated)
other columns about the contact, such as the distance

EDIT: How would look the data in the case depicted earlier.

atoms:

id|atom|amino
1 | O  | HOH
2 | N  | ARG  <- atom A from image
3 | S  | CYS  <- B 
4 | O  | SER  <- C
5 | N  | ARG  <- D

contacts:

donor_id|acceptor_id|directness
1        4           D
1        5           D
2        1           D
3        1           D

From which I need to make

contacts:

donor_id|acceptor_id|directness
3        4           W            <- B-C
3        5           W            <- B-D
2        4           W            开发者_如何学运维<- A-C
2        5           W            <- A-D
2        3           X            <- A-B    (These last two rows are escaping me,
4        5           X            <- C-D     there could be also row D-C, but not
                                             both C-D and D-C. A char 'X' could 
                                             be used to mark "no donor/acceptor")

Current solution (insufficient)

Now, I'm going through all the contacts which have donor.amino = "HOH". In this sample case, this would select contacts from C and D. For each of these selected contacts, I look up contacts having the same acceptor_id as is the donor_id in the currently selected contact. From this information, I create the new contact. At the end, I delete all contacts to or from HOH.

This way, I am obviously unable to create C-D and A-B contacts (the other 4 are OK).

If I try a similar approach - trying to find two contacts having the same donor_id, I end up with duplicate contacts (C-D and D-C).

Is there a simple way to retrieve all six contacts without duplicates?

I'm dreaming about some one page long SQL query which retrieves just these six wanted rows. :-)

However, any other ideas are welcome.

It is preferable to conserve information about who is donor (where possible), but not strictly necessary.

Big thanks to all of you who read this question to this point.

There's one difficulty with your explanation.

What you start with is a directed graph where each edge represents a connection X=>Y where X is a donor and Y an acceptor. The table atoms is the SQL representation of that graph.

What you seem to want is something that is undirected. So that a link X-Y means that X and Y are linked via a water molecule (or some other species of course) but that X and Y could both be donors or acceptors. For this reason your last table has an ambiguity (that you note) so that some links could occur either way around. It seems to me that means that the column headings donor_id and acceptor_id on your final column don't have any meaning that you have explained. This may be my confusion of course.

If all you want is a table with all 6 links in it, each as one row, but don't worry too much about keeping track of the donor/acceptor thing then this works for me in sqlite3:

 create temporary view hoh_view as 
 select donor_id as id, atoms.id as hoh_id from contacts, atoms 
       where acceptor_id=atoms.id and atoms.amino='HOH' 
 union select acceptor_id as id, atoms.id as hoh_id from contacts, atoms 
       where donor_id=atoms.id and atoms.amino='HOH';

 select a.id, b.id from hoh_view as a, hoh_view as b 
       where a.id > b.id and a.hoh_id=b.hoh_id;

Where I have used a temporary view to make things clearer. You can put this all into one big query if you like by replacing each reference to hoh_view by the first query. It feels a bit nasty to me and there may be a way of tidying it up.

If you do want to keep track of donor/acceptor relationships you need to explain how you decide what to do when both amino acids are acceptors or donors (i.e. the last two rows in your example).

If that doesn't do what you want, then maybe I can fix it up so it does.

Well, its hard to provide examples in comments, I decided to post an answer:

If you have to following original data, there is no way to distinguish data from the first structure from those of the second. There should be an additional grouping condition to eleminate directions between the first and the second structure.

sqlite> create table atoms (id INT, atom TEXT, amino TEXT);
sqlite> insert into atoms VALUES (1, 'O', 'HOH');
sqlite> insert into atoms VALUES (2, 'A', 'ARG');
sqlite> insert into atoms VALUES (3, 'B', 'CYS');
sqlite> insert into atoms VALUES (4, 'C', 'SER');
sqlite> insert into atoms VALUES (5, 'D', 'ARG');
sqlite> insert into atoms VALUES (6, 'O1', 'HOH');
sqlite> insert into atoms VALUES (7, 'A1', 'ARG');
sqlite> insert into atoms VALUES (8, 'B1', 'CYS');
sqlite> insert into atoms VALUES (9, 'C1', 'SER');
sqlite> insert into atoms VALUES (10, 'D1', 'ARG');
sqlite> select * from atoms;
1|O|HOH
2|A|ARG
3|B|CYS
4|C|SER
5|D|ARG
6|O1|HOH
7|A1|ARG
8|B1|CYS
9|C1|SER
10|D1|ARG

UPD

Here is the original data:

sqlite> .headers ON
sqlite> .mode columns
sqlite> select * from atoms;
id          atom        amino
----------  ----------  ----------
1           O           HOH
2           A           ARG
3           B           CYS
4           C           SER
5           D           ARG
6           O1          HOH
7           A1          ARG
8           B1          CYS
9           C1          SER
10          D1          ARG
sqlite> select * from contacts;
donor_id    acceptor_id  directness
----------  -----------  ----------
1           4            D
1           5            D
2           1            D
3           1            D
6           9            D
6           10           D
7           6            D
8           6            D

Here is the query:

select
    c1.donor_id, c2.acceptor_id, 'W' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.acceptor_id = c2.donor_id
    and c1.acceptor_id=a.id
    and a.amino='HOH'
UNION ALL
select
    c1.donor_id, c2.donor_id, 'X' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.acceptor_id = c2.acceptor_id
    and c1.acceptor_id=a.id
    and a.amino='HOH'
    and c1.donor_id < c2.donor_id
UNION ALL
select
    c1.acceptor_id, c2.acceptor_id, 'X' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.donor_id = c2.donor_id
    and c1.donor_id=a.id
    and a.amino='HOH'
    and c1.acceptor_id < c2.acceptor_id;

Here is the result:

donor_id    acceptor_id  directness
----------  -----------  ----------
2           4            W
2           5            W
3           4            W
3           5            W
7           9            W
7           10           W
8           9            W
8           10           W
2           3            X
7           8            X
4           5            X
9           10           X

继续阅读：algorithm bioinformatics python sql

Refining data stored in SQLite - how to join several contacts?

Problem background

Database structure

Current solution (insufficient)

Is there a simple way to retrieve all six contacts without duplicates?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Problem background

Database structure

Current solution (insufficient)

Is there a simple way to retrieve all six contacts without duplicates?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？