A two way String hash function
I want to get a unique numeric representation of a String. I know there are lots of ways of doing this, my question is which do you think is the best? I don't want to have negative numbers - so the hashcode() function in java is not so good, although I could override it ... but I'd rather not since I am not so confident and don't want to accidentally break something.
My Strings are all semantic-web URIS. The reason for the numeric representation is that when I display the data for a URI on a page I need something to pass into the query String or put into various fields in my javascript. The URI itself is too unwieldy and looks bad when you have a URI as a value in a URI.
Basically I want to have a class called Resource
which will look like this
Resource{
int id;
String uri;
String value; // this is the label or human readable name
// .... other code/getters/setters here
public int getId(){
return id = stringToIntFunction();
}
private int stringToIntFunction(String uri){
// do magic here
}
}
Can you suggestion a function that would do this if:
- It had开发者_JAVA百科 to be two way, that is you could also recover the original string from the numeric value
- It doesn't have to be two way
Also are there other issues that are important that I am not considering?
If you want it to be reversible, you're in trouble. Hashes are designed to be one-way.
In particular, given that an int
has 32 bits of information, and a char
has 16 bits of information, requiring reversibility means you can only have strings of zero, one or two characters (and even that's assuming that you're happy to encode "" as "\0\0" or something similar). That's assuming you don't have any storage, of course. If you can use storage, then just store numbers sequentially... something like:
private int stringToIntFunction(String uri) {
Integer existingId = storage.get(uri);
if (existingId != null) {
return existingId.intValue();
}
return storage.put(uri);
}
Here storage.put()
would increase a counter internally, store the URI as being associated with that counter value, and return it. My guess is that that's not what you're after though.
Basically, to perform a reversible encryption, I'd use a standard encryption library having converted the string to a binary format first (e.g. using UTF-8). I would expect the result to be a byte[]
.
If it doesn't have to be reversible, I'd consider just taking the absolute value of the normal hashCode()
result (but mapping Integer.MIN_VALUE
to something specific, as its absolute value can't be represented as an int
).
Hashes are one way only (that's part of the reason they have a fixed length regardless of the input size). If you need two-way, you're looking at something like Base64 encoding.
Why can't you have negative numbers? Where do the URIs come from? Are they in a database? Why not use the Database Key ID? If they are not in a database, can you generate them for the user given a set of variables/parameters? (So the query string only contains things like foo=1&bar=two and you generate the URL on the Server or JavaScript side)
Given all the remars done above (hash function is one way), I would go for 2 possible solutions:
- Use some encrypting function to get a long string representing your URL (you'll get something like -> param=456ab894ce897b98f (this could be longer and/or shorter depending on the URL). See DES encryption for instance or base64url.
- Keep track of the URLs in a database (could be also a simple file-based database such as SQLite). Then you'll effectively have an uint <=> URL equivalence.
"Unique representation" implies that the Java supplied string.hashcode would be useless - you'd soon come across two URIs that shared the same hashcode.
Any two-way scheme is going to result in an unwieldy string - unless you store the URIs in a database and use the record ID as your unique identifier.
As far as one-way goes - an MD5 hash would be considerably more unique (but by no means unique) than the simple hashcode - but might be verging on "unwieldy" depending on your definition!
Q1: If you want to recover the string from the number then you could use:
1a: an encryption of the string, which is going to be the same size, or longer, unless you zip the string first. This will give an array of random looking bytes, which could be displayed as Base-64.
1b: a database, or a map, and the number is the index of the string in the map/database.
Q2: The string does not have to be recoverable.
Various ideas are possible here. You can display the hash in hex or in Base-64 to avoid negative signs. The only non-alphanumeric characters in Base-64 are '+', '/' and '='. For an almost unique hash you will need something of cryptographic size, MD5 (128 bits), SHA-1 (160 bits) or SHA-2 (256 or 512 bits).
An MD5 hash looks like "d131dd02c5e6eec4693d9a0698aff95c" in hex; the larger the hash the less likely a collision is.
rossum
精彩评论