开发者

PL/SQL Comparing bit strings, I need an AND operation?

I currently have a script that calculates the tanimoto coefficient o开发者_StackOverflow中文版n the fingerprints of a chemical library. However during testing I found my implementation could not be feasibly scaled up due to my method of comparing the bit strings (It just takes far too long). See below. This is the loop I need to improve. I've simplified this so it is just looking at two structures the real script does permutations about the dataset of structures, but that would over complicate the issue I have here.

LOOP
-- Find the NA bit
SELECT SUBSTR(qsar_kb.fingerprint.fingerprint, var_fragment_id ,1) INTO var_na FROM qsar_kb.fingerprint where project_id = 1 AND structure_id = 1;

-- FIND the NB bit
SELECT SUBSTR(qsar_kb.fingerprint.fingerprint, var_fragment_id ,1) INTO var_nb FROM qsar_kb.fingerprint where project_id = 1 AND structure_id = 2;

-- Test for both bits the same
IF var_na > 0 AND var_nb > 0 then
var_tally := var_tally + 1;
END IF;

-- Test for bit in A on and B off
IF var_na > var_nb then
var_tna := var_tna + 1;
END IF

-- Test for bit in B on and A off.
IF var_nb > var_na then
var_tnb := var_tnb + 1;
END IF;

var_fragment_id := var_fragment_id + 1;
EXIT WHEN var_fragment_id > var_maxfragment_id;
END LOOP;

For a simple example

Structure A = '101010'

Structure B = '011001'

In my real data set the length of the binary is 500 bits and up. I need to know:

1)The number of bits ON common to Both

2)The number of bits ON in A but off in B

3)The number of bits ON in B but off in B

So in this case

1) = 1

2) = 2

3) = 2

Ideally I want to change how I'm doing this. I don't want to be steeping though each bit in each string its just too time expensive when I scale the whole system up with thousands of structures each with fingerprint bit strings in the length order of 500-1000

My logic to fix this would be to:

Take the total number of bits ON in both

A) = 3

B) = 3

Then perform an AND operation and find how many bits are on in both = 1

Then just subtract this from the totals to find the number of bits on in one but not the other.

So how can I perform an AND like operation on two strings of 0's and 1's to find the number of common 1's?


Check out the BITAND function.

The BITAND function treats its inputs and its output as vectors of bits; the output is the bitwise AND of the inputs.

However, according to the documentation, this only works for 2^128


You should move the SELECT out of the loop. I'm pretty sure you're spending 99% of the time selecting 1 bit 500 times where you could select 500 bits in one go and then loop through the string:

DECLARE
   l_structure_a LONG;
   l_structure_b LONG;
   var_na        VARCHAR2(1);
   var_nb        VARCHAR2(1);
BEGIN
   SELECT MAX(decode(structure_id, 1, fingerprint)), 
          MAX(decode(structure_id, 2, fingerprint))
     INTO l_structure_a, l_structure_b
     FROM qsar_kb.fingerprint
    WHERE project_id = 1
      AND structure_id IN (1,2);
   LOOP
      var_na := substr(l_structure_a, var_fragment_id, 1);
      var_nb := substr(l_structure_b, var_fragment_id, 1);

      -- Test for both bits the same
      IF var_na > 0 AND var_nb > 0 THEN
         var_tally := var_tally + 1;
      END IF;   
      -- Test for bit in A on and B off
      IF var_na > var_nb THEN
         var_tna := var_tna + 1;
      END IF;   
      -- Test for bit in B on and A off.
      IF var_nb > var_na THEN
         var_tnb := var_tnb + 1;
      END IF;

      var_fragment_id := var_fragment_id + 1;
      EXIT WHEN var_fragment_id > var_maxfragment_id;
   END LOOP;
END;

Edit: You could also do it in a single SQL statement:

SQL> WITH DATA AS (
  2     SELECT '101010' fingerprint,1 project_id, 1 structure_id FROM dual
  3     UNION ALL SELECT '011001', 1, 2 FROM dual),
  4  transpose AS (SELECT ROWNUM fragment_id FROM dual CONNECT BY LEVEL <= 1000)
  5  SELECT COUNT(CASE WHEN var_na = 1 AND var_nb = 1 THEN 1 END) nb_1,
  6         COUNT(CASE WHEN var_na > var_nb THEN 1 END) nb_2,
  7         COUNT(CASE WHEN var_na < var_nb THEN 1 END) nb_3
  8    FROM (SELECT to_number(substr(struct_a, fragment_id, 1)) var_na,
  9                 to_number(substr(struct_b, fragment_id, 1)) var_nb
 10             FROM (SELECT MAX(decode(structure_id, 1, fingerprint)) struct_a,
 11                           MAX(decode(structure_id, 2, fingerprint)) struct_b
 12                      FROM DATA
 13                     WHERE project_id = 1
 14                       AND structure_id IN (1, 2))
 15            CROSS JOIN transpose);

      NB_1       NB_2       NB_3
---------- ---------- ----------
         1          2          2


I'll sort of expand on the answer from Lukas with a little bit more information.

A little bit of an internet search revealed code from Tom Kyte (via Jonathan Lewis) to convert between bases. There is a function to_dec which will take a string and convert it to a number. I have reproduced the code below:

Convert base number to decimal:

create or replace function to_dec( 
  p_str in varchar2, 
  p_from_base in number default 16) return number
is
    l_num   number default 0;
    l_hex   varchar2(16) default '0123456789ABCDEF';
begin
    for i in 1 .. length(p_str) loop
        l_num := l_num * p_from_base + instr(l_hex,upper(substr(p_str,i,1)))-1;
    end loop;
    return l_num;
end to_dec;

Convert decimal to base number:

create or replace function to_base( p_dec in number, p_base in number ) 
return varchar2
is
    l_str   varchar2(255) default NULL;
    l_num   number  default p_dec;
    l_hex   varchar2(16) default '0123456789ABCDEF';
begin
    if ( trunc(p_dec) <> p_dec OR p_dec < 0 ) then
        raise PROGRAM_ERROR;
    end if;
    loop
        l_str := substr( l_hex, mod(l_num,p_base)+1, 1 ) || l_str;
        l_num := trunc( l_num/p_base );
        exit when ( l_num = 0 );
    end loop;
    return l_str;
end to_base;

This function can be called to convert the string bitmap into a number which can then be used with bitand. An example of this would be:

select to_dec('101010', 2) from dual

Oracle only really provides BITAND (and BIT_TO_NUM which isn't really relevant here` as a way of doing logical operations but the operations required here are (A AND B), (A AND NOT B) and (NOT A AND B). So we need a var of converting A to NOT A. A simple way of doing this is to use translate.

So.... the final outcome is:

select 
  length(translate(to_base(bitand(data_A, data_B),2),'10','1')) as nb_1,
  length(translate(to_base(bitand(data_A, data_NOT_B),2),'10','1'))  as nb_2,
  length(translate(to_base(bitand(data_NOT_A, data_B),2),'10','1'))  as nb_3
from (
  select 
    to_dec(data_A,2) as data_A, 
    to_dec(data_b,2) as data_B,
    to_dec(translate(data_A, '01', '10'),2) as data_NOT_A, 
    to_dec(translate(data_B, '01', '10'),2) as data_NOT_B
  from (
    select '101010' as data_A, '011001' as data_B from dual
  )
)

This is somewhat more complicated than I was hoping when I started writing this answer but it does seem to work.


Can be done pretty simply with something like this:

SELECT utl_raw.BIT_AND( t.A, t.B )                                                 SET_IN_A_AND_B,
       length(replace(utl_raw.BIT_AND( t.A, t.B ), '0', ''))                       SET_IN_A_AND_B_COUNT,
       utl_raw.BIT_AND( t.A, utl_raw.bit_complement(t.B) )                         ONLY_SET_IN_A,
       length(replace(utl_raw.BIT_AND( t.A, utl_raw.bit_complement(t.B) ),'0','')) ONLY_SET_IN_A_COUNT,
       utl_raw.BIT_AND( t.B, utl_raw.bit_complement(t.A) )                         ONLY_SET_IN_A,
       length(replace(utl_raw.BIT_AND( t.B, utl_raw.bit_complement(t.A) ),'0','')) ONLY_SET_IN_A_COUNT       
  FROM (SELECT '1111100000111110101010' A, '1101011010101010100101' B FROM dual) t

Make sure your bit string has an even length (just pad it with a zero if it has an odd length).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜