Counting chars in sequences via SQL
I have a database with a sequence table. Each (amino acid) sequence in this table comprises of 20 different chars (A, V, ...). For instance "MQSHAMQCASQALDLYD...".
I would like to count the number of appearance of each char, so that I get something like "2xM, 3xQ, ...".
Furthermore, I would like to do this over all sequences in my DB, so I get the overall appearance o开发者_高级运维f each char. ("248xM, 71x W,...").
How can I do this in PostgreSQL? At the moment, I am doing it with Ruby, but I have 25,000 sequences with a length of about 400 chars each. This takes a while and I hope it will be faster with SQL.
This is How to find all A's in a string:
select length(regexp_replace('AAADDD', '[^A]', '', 'g'));
This is how to find all A's in a table:
select sum(length(regexp_replace(field, '[^A]', '', 'g'))) from table;
I usually find defining a reusable function more flexible:
CREATE OR REPLACE FUNCTION numch(text, text) RETURNS integer AS
$$
SELECT length($2) - length(replace($2, $1, ''))
$$ LANGUAGE SQL;
And then you can use it like:
SELECT sum(numch('A', colname)) AS nA, sum(numch('V', colname)) AS nV
FROM relname;
To get results like:
na | nv
--------+--------
359541 | 296330
(1 row)
Adopted from this psql-performance thread.
精彩评论