PL/SQL Comparing bit strings, I need an AND operation?
I currently have a script that calculates the tanimoto coefficient o开发者_StackOverflow中文版n the fingerprints of a chemical library. However during testing I found my implementation could not be feasibly scaled up due to my method of comparing the bit strings (It just takes far too long). See below. This is the loop I need to improve. I've simplified this so it is just looking at two structures the real script does permutations about the dataset of structures, but that would over complicate the issue I have here.
LOOP
-- Find the NA bit
SELECT SUBSTR(qsar_kb.fingerprint.fingerprint, var_fragment_id ,1) INTO var_na FROM qsar_kb.fingerprint where project_id = 1 AND structure_id = 1;
-- FIND the NB bit
SELECT SUBSTR(qsar_kb.fingerprint.fingerprint, var_fragment_id ,1) INTO var_nb FROM qsar_kb.fingerprint where project_id = 1 AND structure_id = 2;
-- Test for both bits the same
IF var_na > 0 AND var_nb > 0 then
var_tally := var_tally + 1;
END IF;
-- Test for bit in A on and B off
IF var_na > var_nb then
var_tna := var_tna + 1;
END IF
-- Test for bit in B on and A off.
IF var_nb > var_na then
var_tnb := var_tnb + 1;
END IF;
var_fragment_id := var_fragment_id + 1;
EXIT WHEN var_fragment_id > var_maxfragment_id;
END LOOP;
For a simple example
Structure A = '101010'
Structure B = '011001'
In my real data set the length of the binary is 500 bits and up. I need to know:
1)The number of bits ON common to Both
2)The number of bits ON in A but off in B
3)The number of bits ON in B but off in B
So in this case
1) = 1
2) = 2
3) = 2
Ideally I want to change how I'm doing this. I don't want to be steeping though each bit in each string its just too time expensive when I scale the whole system up with thousands of structures each with fingerprint bit strings in the length order of 500-1000
My logic to fix this would be to:
Take the total number of bits ON in both
A) = 3
B) = 3
Then perform an AND operation and find how many bits are on in both = 1
Then just subtract this from the totals to find the number of bits on in one but not the other.
So how can I perform an AND like operation on two strings of 0's and 1's to find the number of common 1's?
Check out the BITAND function.
The BITAND function treats its inputs and its output as vectors of bits; the output is the bitwise AND of the inputs.
However, according to the documentation, this only works for 2^128
You should move the SELECT out of the loop. I'm pretty sure you're spending 99% of the time selecting 1 bit 500 times where you could select 500 bits in one go and then loop through the string:
DECLARE
l_structure_a LONG;
l_structure_b LONG;
var_na VARCHAR2(1);
var_nb VARCHAR2(1);
BEGIN
SELECT MAX(decode(structure_id, 1, fingerprint)),
MAX(decode(structure_id, 2, fingerprint))
INTO l_structure_a, l_structure_b
FROM qsar_kb.fingerprint
WHERE project_id = 1
AND structure_id IN (1,2);
LOOP
var_na := substr(l_structure_a, var_fragment_id, 1);
var_nb := substr(l_structure_b, var_fragment_id, 1);
-- Test for both bits the same
IF var_na > 0 AND var_nb > 0 THEN
var_tally := var_tally + 1;
END IF;
-- Test for bit in A on and B off
IF var_na > var_nb THEN
var_tna := var_tna + 1;
END IF;
-- Test for bit in B on and A off.
IF var_nb > var_na THEN
var_tnb := var_tnb + 1;
END IF;
var_fragment_id := var_fragment_id + 1;
EXIT WHEN var_fragment_id > var_maxfragment_id;
END LOOP;
END;
Edit: You could also do it in a single SQL statement:
SQL> WITH DATA AS (
2 SELECT '101010' fingerprint,1 project_id, 1 structure_id FROM dual
3 UNION ALL SELECT '011001', 1, 2 FROM dual),
4 transpose AS (SELECT ROWNUM fragment_id FROM dual CONNECT BY LEVEL <= 1000)
5 SELECT COUNT(CASE WHEN var_na = 1 AND var_nb = 1 THEN 1 END) nb_1,
6 COUNT(CASE WHEN var_na > var_nb THEN 1 END) nb_2,
7 COUNT(CASE WHEN var_na < var_nb THEN 1 END) nb_3
8 FROM (SELECT to_number(substr(struct_a, fragment_id, 1)) var_na,
9 to_number(substr(struct_b, fragment_id, 1)) var_nb
10 FROM (SELECT MAX(decode(structure_id, 1, fingerprint)) struct_a,
11 MAX(decode(structure_id, 2, fingerprint)) struct_b
12 FROM DATA
13 WHERE project_id = 1
14 AND structure_id IN (1, 2))
15 CROSS JOIN transpose);
NB_1 NB_2 NB_3
---------- ---------- ----------
1 2 2
I'll sort of expand on the answer from Lukas with a little bit more information.
A little bit of an internet search revealed code from Tom Kyte (via Jonathan Lewis) to convert between bases. There is a function to_dec
which will take a string and convert it to a number. I have reproduced the code below:
Convert base number to decimal:
create or replace function to_dec(
p_str in varchar2,
p_from_base in number default 16) return number
is
l_num number default 0;
l_hex varchar2(16) default '0123456789ABCDEF';
begin
for i in 1 .. length(p_str) loop
l_num := l_num * p_from_base + instr(l_hex,upper(substr(p_str,i,1)))-1;
end loop;
return l_num;
end to_dec;
Convert decimal to base number:
create or replace function to_base( p_dec in number, p_base in number )
return varchar2
is
l_str varchar2(255) default NULL;
l_num number default p_dec;
l_hex varchar2(16) default '0123456789ABCDEF';
begin
if ( trunc(p_dec) <> p_dec OR p_dec < 0 ) then
raise PROGRAM_ERROR;
end if;
loop
l_str := substr( l_hex, mod(l_num,p_base)+1, 1 ) || l_str;
l_num := trunc( l_num/p_base );
exit when ( l_num = 0 );
end loop;
return l_str;
end to_base;
This function can be called to convert the string bitmap into a number which can then be used with bitand. An example of this would be:
select to_dec('101010', 2) from dual
Oracle only really provides BITAND
(and BIT_TO_NUM
which isn't really relevant here` as a way of doing logical operations but the operations required here are (A AND B), (A AND NOT B) and (NOT A AND B). So we need a var of converting A to NOT A. A simple way of doing this is to use translate.
So.... the final outcome is:
select
length(translate(to_base(bitand(data_A, data_B),2),'10','1')) as nb_1,
length(translate(to_base(bitand(data_A, data_NOT_B),2),'10','1')) as nb_2,
length(translate(to_base(bitand(data_NOT_A, data_B),2),'10','1')) as nb_3
from (
select
to_dec(data_A,2) as data_A,
to_dec(data_b,2) as data_B,
to_dec(translate(data_A, '01', '10'),2) as data_NOT_A,
to_dec(translate(data_B, '01', '10'),2) as data_NOT_B
from (
select '101010' as data_A, '011001' as data_B from dual
)
)
This is somewhat more complicated than I was hoping when I started writing this answer but it does seem to work.
Can be done pretty simply with something like this:
SELECT utl_raw.BIT_AND( t.A, t.B ) SET_IN_A_AND_B,
length(replace(utl_raw.BIT_AND( t.A, t.B ), '0', '')) SET_IN_A_AND_B_COUNT,
utl_raw.BIT_AND( t.A, utl_raw.bit_complement(t.B) ) ONLY_SET_IN_A,
length(replace(utl_raw.BIT_AND( t.A, utl_raw.bit_complement(t.B) ),'0','')) ONLY_SET_IN_A_COUNT,
utl_raw.BIT_AND( t.B, utl_raw.bit_complement(t.A) ) ONLY_SET_IN_A,
length(replace(utl_raw.BIT_AND( t.B, utl_raw.bit_complement(t.A) ),'0','')) ONLY_SET_IN_A_COUNT
FROM (SELECT '1111100000111110101010' A, '1101011010101010100101' B FROM dual) t
Make sure your bit string has an even length (just pad it with a zero if it has an odd length).
精彩评论