开发者

Is there any good way to "encode" binary data as plausible madeup words and back again?

To give you a very simple and bad example. The data is split in 4 bits. The 16 possible numbers correspond to the first 16 consonants. You add a random vowel to make it pronounceable. So "08F734F7" can become "ba lo ta ku fo go ta ka". You can join some syllables and add punctuation and capitalization and it can become "Balo ta kufogo, Taka?" which looks like a plausible language.

Just to make it clear, I'm not trying to protect the binary data.

I want to use this after I compress and encrypt my (UTF-8) plain text diary. The resulting binary data should look pretty random. I need to convert this data into a plausible looking langua开发者_StackOverflow社区ge and be able to revert it back. I'm going to print the "language" on paper and make a custom book.

So what I'm looking for is the best method of converting random data to readable plausible words. By good I mean the biggest bits to letters ratio (while making it look like a real language). In my example it's exactly 2 bits per letter. Or 4 letters for a byte.


FASCINATING question!

My best solution so far encodes 12 bits in 2 to 4 characters, giving 3 to 6 bits per letter. (Friday is not a good day to do the necessary maths on the uneven distribution of word lengths, so I haven't worked out the average bits per letter).

The idea is to work with "phonemes" that start with one or two consonants, and end with one or two vowels. There are 21 consonants and I feel that each one could be followed by an h, l, r, w or y, and still look reasonable. So your phoneme starts with one of 126 consonant parts - b, bh, bl, br, bw, by, c, ch, cl, cr, ..., z, zh, zl, zr, zw, zy (admittedly, thinks like yy and zl look a bit weird, but it is a foreign language after all :) )

126 is so tantalisingly close to 128 that we could add t' and b' (for example) for the last two values - giving us a dictionary of 128 values, to store 7 bits. You could even add replace yy with d' and zl with p' or whatever.

Similarly, the vowel portion could be a single vowel or a pair of vowels. I have eliminated aa, ii and uu because they look too weird to me (personal preference) even though they do occur in some real words (who decided "continuum" should be spelt that way anyway!). So that gives 27 possible vowel parts: a, e, i, o, u, ae, ai, ao, ..., ue, ui, uo.

27 is close to 32, so throw in 5 values using an accented vowels (é, â, etc). That gives us 5 bits with the added benefit of some sparse accenting.

So that's 12 bits in 2, 3 or 4 letters.

For more fun, if the next bit is a 1, insert a space 90% of the time (at random), or a punctuation mark the other 10% - but if the next bit is a 0, don't insert anything - just start the next phoneme. Capitalise the first letter after punctuation.

That should give you something like:

Bwaijou t'ei plo ku bhaproti! Llanoi proimlaroo jaévli.

Maybe someone can take it further.


Summary


This solution will let you use any of a large number of languages including pronounceable nonsense with a customizable efficiency. You can even create something that looks like grammatically correct but meaningless English or French (or worse, something that shifts between the two like a drunken polyglot). The basic idea is to use your data to continually select paths from a context free grammar until you run out of data.

Details


Add a string to the end of your input that doesn't occur anywhere inside of it ("This is the end of my input, thank you very much" would be very unlikely to occur in a string of encrypted text, for example.) You can do this without the string but it makes it easier.

Treat your input as one very long integer encoded low-bit first. Obviously your machine won't be able to process such a big integer, every time you have a zero high byte, just strip off the next byte worth of values from your file and multiply them in.

Create your language as a context free grammar. To avoid forgetting what the encoding is, you can print it at the end of your book. Avoid ambiguity. If your grammar is ambiguous, you won't be able to decode. This is not hard, essentially don't use the same terminal in two places, ensure that the concatenation of two terminals cannot make another terminal, and ensure that reading the output you can tell where you put the formatting symbols.

Now, to take an integer and turn it into language, use the following pseudo-code that uses n to choose which production to take.

cur=grammar.root (cur is a list of tokens)
n=my input as one big integer 
while(n > 0 || cur != grammar root){
    if (cur.first.isTerminalSymbol) { 
        output cur.first
        cur.pop_first
        if(cur.isEmpty){
            cur = grammar root
        }
    }else{
        p = grammar[cur.first].number of productions
        t = n mod p // t = low order digit base p
        n = n div p // Shift left in base p
        cur.pop_first
        cur.push_first( grammar[cur.first].productionNumber[t] )
    }
}

To decode you use a standard parser generator like GNU bison which should also help you avoid creating an ambiguous grammar.

Run the parser on the input. Now, start n at 0. You can get the production number at each time by referencing the syntax tree generated by the parser. Then multiply n by the number of productions and add the production number to get n after that particular input. As you fill up the lower byte of your machine word, shift it off into your decoded file. When you read your unique phrase, stop decoding.

Example 1


This will be clearer with an example or three.

My example simple language is as follows (non-terminals are capitalized). Note because of the large size of the terminals compared with their depth of the tree, it is not very efficient but I think that having more terminals or making them shorter can give you any efficiency you want (up to the number of bits wasted per character by using n bits per character).

  • S -> __capital Noun I-Verb Punct | __capital Noun T-Verb Noun Punct
  • Noun -> joe | sally | spot | the car | the cookie | the hose
  • I-Verb -> runs | lives | jumps | flies
  • T-Verb -> jumps over | eats | grows | spins
  • Punct -> . | ! | ?

You could just as easily do this with syllables as an expansion of verbs and nouns. You could also include noun-phrases and verb phrases to have adjectives etc. in your language. You will probably also want paragraph and chapter symbols that break down into appropriate subunits with formatting. The number of alternate choices at a certain level of the tree determines the average number of bits encoded by each symbol. __capital is an example of a formatting symbol that, on output, capitalizes the next word.

So, imagine that my input becomes the number 77. Then I would encode it as follows:

S goes to two things. 77 % 2 = 1. Remainder 77 / 2 = 38.

Now our number is 38 and we are expanding __capital, Noun, T-Verb, Noun, Punct

First word is __capital which is a terminal symbol. Output __capital (setting the print routine to capitalize the next word).

Now expanding Noun. Noun has 6 options. 38 % 6 = 2. 38 / 6 = 6. We choose spot

Now expanding spot,T-verb,Noun,Punct. Spot is terminal. Output spot. The printer being in capital mode writes "Spot" to the output file.

Now expanding T-Verb. Our number is 6. T-verb has 4 options. 6 % 4 = 2. 6 / 4 = 1. So we choose "grows". In the next step we output grows to our file since it is a terminal.

Now expanding Noun, Punct. Noun has 6 options. Our number is 1. 1 % 6 = 1 1/6 = 0. So we choose "sally", which is output on the next step.

Finally we are expanding Punct which has 3 options. Our number is 0 (and will stay that way forever - this is why you append an end-of-text string to the end of your input, otherwise your decoding would end with an infinite string of zeroes.) We choose ".", which is output.

Now the current string to expand is empty so we set it back to the root "S". But since n is 0, the algorithm terminates.

Thus 77 has become "Spot grows sally."

Example 2


Things get more efficient if I replace my terminals with:

  • I-Verb IVS _space | IVS I-Verb
  • IVS IVSS vowel
  • IVSS w | r
  • Vowel a | e | i | o | u | y
  • T-Verb TVS _space | TVS T-Verb
  • TVS TVSS vowel
  • TVSS p | s
  • Noun NS _space | NS
  • NS NSS Vowel
  • NSS j | v

77 yields "Jo papa ja." under this encoding (and is really encoded by just the "Jo " and the fact that papa has 2 syllables. The extra would be a very small fraction in any book-length file.)

Example 3


Your example "08F734F7" would be "1000111101110011010011110111" in binary, which is "1110111100101100111011110001" when reversed and that is, 250793713 in decimal.

If I run that through the more compact grammar, I get:

25079713 % 2 = 1  n=125396856, S-> __capital Noun T-Verb Noun Punct
125396856 % 2 = 0 n=62698428,  Noun->NS _space-> NSS Vowel _space
62698428 % 2 = 0  n=31349214,  NSS->j
31349214 % 6 = 0  n=5224869,   Vowel->a
5224869 % 2 = 1   n=2612434,   T-Verb->TVS T-Verb->TVSS Vowel T-Verb
2612434 % 2 = 0   n=1306217,   TVSS->p
1306217 % 6 = 5   n=217702,    Vowel->y
217702 % 2 = 0    n=108851,    T-Verb->TVSS Vowel _space
108851 % 2 = 1    n=54425,     TVSS->s
54425 % 6 = 5     n=9070,      Vowel->y
9070 % 2 = 0      n=4535,      Noun->NSS Vowel _space
4535 % 2 = 1      n=2267       NSS->v
2267 % 6 = 5      n=377        Vowel->y
377 % 3 = 2       n=125        Punct->?
125 % 2 = 1       n=62         S->__capital Noun T-Verb Noun Punct
62 % 2 = 0        n=31         Noun->NSS Vowel _space
31 % 2 = 1        n=15         NSS->v
15 % 6 = 3        n=2          Vowel->o
2 % 2 = 0         n=1          T-Verb->TVSS Vowel _space
1 % 2 = 1         n=0          TVSS->p
                  n=0          Vowel _space Noun Punct -> "a ja." 

This yields: "Ja pysy vy? Vo pa ja." from 08F734F7 (note that my print routine removes spaces before punctuation)


This is an old question, but very interesting.

Once I wanted to do a similar conversion, but having other goals. Guid (uuids) are usually not eye-friendly so I had to convert it to a plausible words. The final system was based on occurrence of English letter after two preceding ones. This table was made using a corpus of English sentences and the ones that was used too rarely was excluded. So the final table contained lines looking like

  ...
  (key:'_t'; next:'aehioruwy'),
  (key:'_u'; next:'lmnprst'),
  (key:'_w'; next:'aehiory'),
  (key:'ab'; next:'abeilorsuwy'),
  (key:'ac'; next:'_acehikloqrtuy'),
  ...

containing about 200-300 lines, where 'next' is all possible letters that can appear after 'key' letters (_ is the begin or end of the word depending on whether it's in the key or in next).

The process of conversion took the current value, divide it modulo length(next) and take remainder as the corresponding letter as the next 'plausible' symbol, quotient becomes new current value. To avoid long words there was a trick to explicitly end words used symmetrically by encoding and decoding. This system could produce for example such sequences (input for each is 128-bit guid/uuid)

Furepas_Wann_Hunkare_Rylacid_Makinuag_Dreem

Togo_Ragam_Omb_Bonsbe_Gonn_Eclecki_Op

or if we take some widely used guids, for example MS IWebBrowser2 {D30C1661-CDAF-11D0-8A3E-00C04FC9E26E}

Lakar_Rupplex_Waylagit_Munghim_Paddato_Molu 

("Lakar Rupplex" is a good human name for a browser, isn't it?)

As for the density, this system gave about 3 bits per letter density.


I personally would use c++. For a program that would do what you describe I would make something like this:

void JumbleData(const void *src, int srcLen, char *dest)
{
  for(int i = 0; i < srcLen; i++)
  {
    unsigned char data = *((unsigned char*)src+i);
    unsigned char lower  = data & 0x0F;
    unsigned char higher = (data & 0xF0) >> 4;

    dest = 'a' + lower; dest++;
    dest = 'a' + higher; dest++
  }
}

This should break up the src data into 4 bit sections, add that to 'a' and put it in the destination. You can then go through and add extra letters between, but only as long as you have a string way of reversing the process.

To make it a little less obvious I would use more than 4 bits at a time, but not an even 8 either. Here is an example with using chunks of 6 bits:

void AddData(char* &dest, unsigned char data);

void JumbleData(const void *src, int srcLen, char *dest)
{
  for(int i = 0; i < srcLen; i+=3)
  {
    unsigned char data0 = *((unsigned char*)src+i);
    unsigned char data1 = *((unsigned char*)src+i+1);
    unsigned char data2 = *((unsigned char*)src+i+2);

    unsigned char chunk0 = data0 & 0x3F;
    unsigned char chunk1 = (data0 >> 6) | ((data1 & 0x0F) << 2);
    unsigned char chunk2 = (data1 >> 4) | ((data2 & 0x03) << 4);
    unsigned char chunk3 = data2 >> 2;

    AddData(dest, chunk0);
    AddData(dest, chunk1);
    AddData(dest, chunk2);
    AddData(dest, chunk3);
  }
}

void AddData(char* &dest, unsigned char data)
{
  const char vowels[] = {'a', 'e', 'i', 'o'};
  char letter1 = 'a' + (data & 0x0F);
  char letter2 = vowels[((data & 0x0C) >> 2)];
  char letter3 = 'n' + ((data & 0x3C) >> 2);
  *dest = letter1;
  dest++;
  *dest = letter2;
  dest++;
  *dest = letter3;
  dest++;
  *dest = ' ';
  dest++;
}

This would make a jumble of 3 letter words for each chunk of 6 bits.


You could do a simple substitution algorithm with a set conversion table that varies based on the power of the digit in the original number. Weight the values in the conversion tables so that vowels and certain consonants are more common. Pick some base large enough to have variety across the places. E.g. (hex based data):

value | place  
      | 0 1 2 ...
------|------ - - -
  0   | a a a ...
  1   | e e e
  2   | i i i
  3   | o o q
  4   | u u w
  5   | y q r
  6   | q w f
  7   | w r g
  8   | r f h
  9   | t g j
  A   | p h k
  B   | s j c
  C   | d k v
  D   | f l b
  E   | g z n
  F   | h x m ...

(This could also be done with simple well chosen formulas for each column...)

So,

B4B => "suc"
3AA => "ohk"
F62 => "iwm"
...

Extend this out to enough columns to control the distribution of all the characters well. If your source data does not have a cleanly random distribution you may want to mix up the order from column to column as well. Notice how some characters exist in every column, and some only exist once. Also the vowel to consonant frequency can be tweaked by changing the average ratio in every column.

Take large fixed size chunks of the data and run them through the converter, then apply a spacing/punctuation/capitalization algorithm.

(There is no guarantee that you won't end up with an all consonant or extremely low vowel count word, but you can have the capitalization algorithm make it all caps to look like an acronym/initialism)


Read here http://email.about.com/cs/standards/a/base64_encoding.htm

Base64 encoding takes three bytes, each consisting of eight bits, and represents them as four printable characters in the ASCII standard. It does that in essentially two steps.

The first step is to convert three bytes to four numbers of six bits. Each character in the ASCII standard consists of seven bits. Base64 only uses 6 bits (corresponding to 2^6 = 64 characters) to ensure encoded data is printable and humanly readable. None of the special characters available in ASCII are used. The 64 characters (hence the name Base64) are 10 digits, 26 lowercase characters, 26 uppercase characters as well as '+' and '/'.

If, for example, the three bytes are 155, 162 and 233, the corresponding (and frightening) bit stream is 100110111010001011101001, which in turn corresponds to the 6-bit values 38, 58, 11 and 41.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜