开发者

Bitwise permutation of multiple 64bit values in parallel / combined

This question is NOT about "How do i bitwise permutation" We now how to do that, what we are looking for is a faster way with less cpu instructions, inspired by the bitslice implementation of sboxes in DES

To speed up some cipher code we want to reduce the amount of permutation calls. The main cipher functions do multiple bitwise permutations based on lookup arrays. As the permutation operations are only bitshifts,

Our basic idea is to take multiple input values, that need the same permutation, and shift them in parallel. For example, if input bit 1 must be moved to output bit 6.

Is there any way to do this? We have no example code right now, because there is absolutly no idea how to accomplish this in a performant way.

The maximum value size we have on our plattforms are 128bit, the longest input value is 64bit.Therefore the code must be faster, then doing the whole permutation 128 times.

EDIT

Here is a simple 8bit example of a permutation

+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | <= Bits
+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | <= Input
+---+---+---+---+---+---+---+---+
| 3 | 8 | 6 | 2 | 5 | 1 | 4 | 7 | <= Output
+---+---+---+---+---+---+---+---+

The cipher makes usage of multiple input keys. It's a block cipher, so the same pattern must be applied to all 64bit blocks of the input.

As the permutations are the same for each input block, we want to process multiple input blocks in one step / to combine the operations for multiple input sequences. Instead of moving 128times one bit per call, moving 1 time 128bit at once.

EDIT2

We could NOT use threads, as we have to run the code on embedded systems without threading support. Therefore we also have no access on external libraries and we have to keep it plain C.

SOLUTION

After testing and playing with the given answers we have done it the follow开发者_JS百科ing way:

  • We are putting the single bits of 128 64bit values on a uint128_t[64]* array.
  • For permutation we have just to copy pointers
  • After all is done, we revert the first operation and get 128 permuted values back

Yeah, it is realy that simple. We was testing this way early in the project, but it was too slow. It seems we had a bug in the testcode.

Thank you all, for the hints and the patience.


You could make Stan's bit-by-bit code faster by using eight look-up tables mapping bytes to 64-bit words. To process a 64-bit word from input, split it into eight bytes and look up each from a different look-up table, then OR the results. On my computer the latter is 10 times faster than the bit-by-bit approach for 32-bit permutations. Obviously if your embedded system has little cache, then 32 kB 16 kB of look-up tables may be a problem. If you process 4 bits at a time, you only need 16 look-up tables of 16*8=128 bytes each, i.e. 2 kB of look-up tables.

EDIT: The inner loop could look something like this:

void permute(uint64_t* input, uint64_t* output, size_t n, uint64_t map[8][256])
{
    for (size_t i = 0; i < n; ++i) {
        uint8_t* p = (uint8_t*)(input+i);
        output[i] = map[0][p[0]] | map[1][p[1]] | map[2][p[2]] | map[3][p[3]]
            | map[4][p[4]] | map[5][p[5]] | map[6][p[6]] | map[7][p[7]];
    }
}


I think you might be looking for a bit-slicing implementation. This is how the fastest DES-cracking impelmentations work. (Or it was before SSE instructions existed, anyway.)

The idea is to write your function in a "bit-wise" manner, representing each output bit as a Boolean expression over the input bits. Since each output bit depends only on the input bits, any function can be represented this way, even things like addition, multiplication, or S-box lookups.

The trick is to use the actual bits of a single register to represent a single bit from multiple input words.

I will illustrate with a simple four-bit function.

Suppose, for example, you want to take four-bit inputs of the form:

x3 x2 x1 x0

...and for each input, compute a four-bit output:

x2 x3 x2^x3 x1^x2

And you want to do this for, say, eight inputs. (OK for four bits a lookup table would be fastest. But this is just to illustrate the principle.)

Suppose your eight inputs are:

A = a3 a2 a1 a0
B = b3 b2 b1 b0
...
H = h3 h2 h1 h0

Here, a3 a2 a1 a0 represent the four bits of the A input, etc.

First, encode all eight inputs into four bytes, where each byte holds one bit from each of the eight inputs:

X3 =  a3 b3 c3 d3 e3 f3 g3 h3
X2 =  a2 b2 c2 d2 e2 f2 g2 h2
X1 =  a1 b1 c1 d1 e1 f1 g1 h1
X0 =  a0 b0 c0 d0 e0 f0 g0 h0

Here, a3 b3 c3 ... h3 is the eight bits of X3. It consists of the high bits of all eight inputs. X2 is the next bit from all eight inputs. And so on.

Now to compute the function eight times in parallel, you just do:

Y3 = X2;
Y2 = X3;
Y1 = X2 ^ X3;
Y0 = X1 ^ X2;

Now Y3 holds the high bits from all eight outputs, Y2 holds the next bit from all eight outputs, and so on. We just computed this function on eight different inputs using only four machine instructions!

Better yet, if our CPU is 32-bit (or 64-bit), we could compute this function on 32 (or 64) inputs, still using only four instructions.

Encoding the input and decoding the output to/from the "bit slice" representation takes some time, of course. But for the right sort of function, this approach offers massive bit-level parallelism and thus a massive speedup.

The basic assumption is that you have many inputs (like 32 or 64) on which you want to compute the same function, and that the function is neither too hard nor too easy to represent as a bunch of Boolean operations. (Too hard makes the raw computation slow; too easy makes the time dominated by the bit-slice encoding/decoding itself.) For cryptography in particular, where (a) the data has to go through many "rounds" of processing, (b) the algorithm is often in terms of bits munging already, and (c) you are, for example, trying many keys on the same data... It often works pretty well.


It seems difficult to do the permutation in only one call. A special case of your problem, reversing bits in an integer, needs more than one 'call' (what do you mean by call?). See Bit Twiddling Hacks by Sean for information of this example.

If your mapping pattern is not complicated, maybe you can find a fast way to calculate the answer:) However, I don't know whether you like this direct way:

#include <stdio.h>

unsigned char mask[8];

//map bit to position
//0 -> 2
//1 -> 7
//2 -> 5
//...
//7 -> 6
unsigned char map[8] = {
    2,7,5,1,4,0,3,6
};


int main()
{
    int i;

    //input:
    //--------------------
    //bit 7 6 5 4 3 2 1 0
    //--------------------
    //val 0 0 1 0 0 1 1 0
    //--------------------
    unsigned char input = 0x26;

    //so the output should be 0xA1:
    //    1 0 1 0 0 0 0 1
    unsigned char output;

    for(i=0; i<8; i++){ //initialize mask once
        mask[i] = 1<<i;
    }

    //do permutation
    output = 0;
    for(i=0; i<8; i++){
        output |= (input&mask[i])?mask[map[i]]:0;
    }

    printf("output=%x\n", output);
    return 0;
}


Your best bet would be to look into some type of threading scheme ... either you can use a message-passing system where you send each block to a fixed set of worker threads, or you can possibly setup a pipeline with non-locking single producer/consumer queues that perform multiple shifts in a "synchronous" manner. I say "synchronous" because a pipeline on a general-purpose CPU would not be a truly synchronous pipeline operation like you would have on a fixed-function device, but basically for a given "slice" of time, each thread would be working on one stage of the multi-stage problem at the same time, and you would "stream" the source data into and out of the pipeline.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜