How can I quickly run a function over an array of every possible array of length L with given possible elements?
I have a function calc_dG that, for any array corresponding to a short DNA sequence (3 to 15 bases or so), gives me the binding energy of that sequence. Actually,开发者_C百科 it's just an array lookup. nndG is an array of binding energies for adjacent pairs of bases, and thus the binding energies can be calculated with nndG[4*S[:-1]+S[1:]]
when using an a,g,c,t -> 0,1,2,3 way of denoting sequences numerically: this means that arrays of many sequences can be calculated at once very quickly in numpy.
I need to find, for a length L, every sequence that both fits some template and results in a binding energy value in a certain range.
This is very easy to do with iterators: just iterate through every possible array input, calculate the binding energy, and then record the arrays that are in the range. This, however, is far too slow when implemented in Python (for length 15 with 4 possible values for each element there are 4**15 possible arrays, etc etc). I could use Weave or some other method of implementing it in C, but I'd prefer to find an array-based solution that is simple and fast.
For example, if every element has the same possible values (eg, [0,1,2,3]), then generating an array of every possible length L 1D array with those values can be done with lambda x: indices(repeat([4],L)).reshape((L,-1)).transpose()
; then I can just do calc_dG( result )
, and use result[results that are in the desired range] to get the arrays that I want as a final result. This is much faster than using Python iterators, and likely almost as fast, if not faster, than using C iterators. Unfortunately, it doesn't work for arbitrary templates, and for longer sequences, will run out of memory, as it has to store every possible array in memory before calculating values.
Is there some way to do all of this without resorting to C?
If I understand your problem correctly, you are maximizing a function f(i_1, i_2, ..., i_n)
over integers in the set {0, 1, 2, 3}.
You can use a combination of iteration and vectorized indexing.
import numpy as np
import itertools
def cartesian_chunked(n, n_items=4, chunk_dim=3):
if n > chunk_dim:
p = n - chunk_dim
q = chunk_dim
outer = itertools.product(*([range(n_items)] * (n - chunk_dim)))
else:
p = 0
q = n
def outer_iter():
yield ()
outer = outer_iter()
chunk = np.zeros([n_items**q, n], dtype=int)
chunk[:,p:] = np.indices(np.repeat([n_items], q)).reshape(q, -1).T
for seq in outer:
chunk[:,:p] = seq
yield chunk
def compute_energy(indices):
base_energies = np.array([-1, 4, 8, 2.4])
return (base_energies[indices]).sum(axis=1)
max_energy = 0
max_config = None
# try out 4**10 ~ 1e6 combinations, in chunks of 4**8
for chunk in cartesian_chunked(n=10, n_items=4, chunk_dim=8):
energies = compute_energy(chunk)
j = np.argmax(energies)
if energies[j] > max_energy:
max_energy = energies[j]
max_config = chunk[j].copy() # copy! the chunk is modified
print max_energy
print max_config
精彩评论