String encoding of primitive types preserving lexicographic order
Does anyone know of a library for encoding a number of primitiv开发者_JAVA技巧e types (like integers, floats, strings, etc) into a string but preserving the lexicographical order of the types?
Ideally, I'm looking for a C++ library, but other languages are fine too. Also, one can assume that the format does not need to be encoded in the string itself (that is, if it's int64/string/float then the encoded string does not need to encode this information, only encoding the data is enough).
Take a look at this paper ("Efficient Lexicographic Encoding of Numbers") which shows how to represent any numeric type as a string such the lexicographic order of the strings is the same as the numerical order of the underlying numbers. It copes with arbitrary length numbers.
http://www.zanopha.com/docs/elen.pdf
I had the problem of converting integers and longs to strings which preserve ordering. And since I was working in Java, I only had signed types.
My algorithm was very simple:
- Flip the sign bit (
toEncode ^ Long.MAX_VALUE
for longs) otherwise negative numbers are greater than positive numbers. - Do a modified base64 encoding of the bytes. Unfortunately, the normal base64 encoding doesn't preserve ordering; the special characters (
+
and/
) are after the numbers which are after the characters. This is completely backwards from ASCII. My modified encoding simply uses the ASCII ordering. (To make it clear it wasn't normal base64, I changed the special chars to-
and_
with~
as the padding. These are still useable within an URL, which was another a constraint I had.)
BTW ... In Amazon Web Service's SimpleDB, all data are stored as strings. Its select comparators use lexicographic ordering. AWS provides utility functions to encode various types. For example, integers are encoded knowing the range of the integers apriori and adjusting via zero-padding and offsets (e.g. for negative integers). You could of course give it the worst possible range.
See "Query 201: Tips and Tricks for Amazon SimpleDB Query" - http://aws.amazon.com/articles/1232
http://typica.s3.amazonaws.com/com/xerox/amazonws/sdb/DataUtils.html
Just write numeric values in a fixed column width with leading zeros, and strings as normal. So like this:
0.1 -> 0000000.1000000
123 -> 0000123.0000000
foo -> foo
X -> X
Then you can sort as text (e.g. Unix sort
without -n
). How about that?
精彩评论