开发者

URL shortening algorithm

Now, this is not strictly about URL shortening, but my purpose is such anyway, so let's view it like that. Of course the steps to URL shortening are:

  1. Take the full URL
  2. Generate a unique short string to be the key for the URL
  3. Store the URL and the key in a database (a key-value st开发者_如何学Goore would be a perfect match here)

Now, about the second point. Here's what I've come up with:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
UUID uuid = UUID.randomUUID();
dos.writeLong(uuid.getMostSignificantBits());
String encoded = new String(Base64.encodeBase64(baos.toByteArray()), "ISO-8859-1");
String shortUrlKey = StringUtils.left(encoded, 6); // returns the leftmost 6 characters
// check if exists in database, repeat until it does not

Is this good enough?


For a file upload application I wrote, I needed this functionality, too. Having read this SO article, I decided to stick with just some random numbers and check whether they exists in the DB.

So your aproach is similar to what I did.


Well what do you mean by URL shortening?

There are very different techniques. Most websites, AFAIK, use the technique to just put the databse primary key (maybe in some encoded) form in the URL at some position where it can be parsed by a regular expression and just enhancing the rest with keywords.

Example from Amazon: http://www.amazon.de/Bauknecht-WA-PLUS-614-Waschmaschine/dp/B003V1JDU8/

You can enter anything in place of the name of the product, only the id at the end is important.

However you may want to keep your links clean and check if it's correct and do 301 forwarding to the real URL or put a canonical URL if a wrong URL turns up.

However:

If you want to do something like TinyURL, my answer is a definite no.

It's not good enough.

Well it depends.

It's not "secure". It would be pretty easy to guess URLs. A better approach would be using some cryptographic function like SHA-1/MD5.

When it comes to collisions I can't really tell. GUID was designed to have no collisions, but you are only using the first 6 characters. I don't know what exactly they represent in the algorithm. But it's definitely not optimal.

Why, however, don't you just use the database auto incrementing primary key? If security is important you also definitely have go to with more than 6 characters.

On a project I did I used something like

/database-primary-key/hash-of-primary-key-with-some-token-or-client-information/

This way I could directly look up the primary key in the database which was the fastest possible way but also could verify that the link was not found out by brute forced by the hash. In my case the hash was the SHA-1 sum of the client's secret token and the primary key.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜