开发者

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.

My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.

Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?

Here is how I imagine this working (though you may have a better way):

  1. This source code will take a given String as input
  2. The bytes representation of that 开发者_Go百科string will be taken (UTF8, ASCII, you decide)
  3. Some magic happens - (this is the part I need your help on)
  4. The resulting bytes will be converted into an int or long (no decimal points)
  5. The number will be converted into a corresponding character using this utility http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651

(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')

Background

The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.

In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.


These are just some ideas.

Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.

If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜