URL shortener: best encoding method?
I'm creating a link shortening service and I'm using base64 encoding/decoding of an incremented ID field to create my urls. A url with the ID "6" would be: http://mysite.com/Ng==
I need to also allow users to create a custom url name, like http://mysite.com/music
Here's my (possibly faulty) approach so far开发者_Go百科. Help in fixing it would be appreciated.
When someone creates a new link:
- I get the largest link ID from the database (it's not auto incremented)
- Increment the ID by 1
- Generate a short URL code (http://website.com/[short url name]) by base64_encoding that ID
- Insert into links table: id, short_url_code, destination_url
When someone creates a new link and passes a custom short URL:
- My plan was base64_decode their custom string and use that as the link ID, but I didn't realize that you can't just base64_decode any alphanumeric string and turn it into a number.
Is there a better encoding method that will let me turn any number into a short string, and any string into a number, so I can always lookup short urls (whether custom or autogenerated) by turning the name into a number and querying for a link with an ID equal to that number?
First and foremost, make sure you have unicity constraints in place on the ID
and short_url_code
columns.
When someone creates a new link:
- Get the next largest link
ID
from the database (for performance reasons you should really REALLY useautoincrement
orSEQUENCE
, depending on what your RDBMS offers; otherwise go ahead and selectMAX(ID)+1
) - Generate a short URL code (
http://website.com/[short url name]
) fromID
usingbase64_encode
or any other custom or standard encoding scheme - Insert into the
links
table:ID, short_url_code, destination_url
If the insert fails because of a constraint violation go back to step 1 to try a new
ID
; you may have had a violation because:- the same ID has already been used (i.e. inserted) in parallel by another thread/process etc. (this will not happen if you used
autoincrement
orSEQUENCE
, and may happen quite often otherwise), and/or - the same
short_url_code
has already been used as a custom URL (this will happen very seldomly unless someone is trying to cause trouble on your site)
- the same ID has already been used (i.e. inserted) in parallel by another thread/process etc. (this will not happen if you used
If the insert succeeded, commit and return the short URL to the user
When someone creates a new link and passes a custom short URL:
- Perform the same step 1 as above
- Instead of generating the short URL part from
ID
as in step 2 above, use the customshort_url_code
provided by the user - Perform the same step 3 as above
- If the insert failed because of:
- a constraint violation on
ID
: go back to step 1 to try a newID
- a constraint violation on
short_url_code
: return an error to the user asking him to pick a different custom URL, as the short URL he/she provided has already been used
- a constraint violation on
- Perform the same step 5 as above
base64 can be used to make short urls, but it can also make the url longer. For instance the base64_encode of the number 1 is 'MQ==' which is 4 times the size. Base64 will always have 2 characters to obtain the 64bits, which is not ideal for short urls.
If size is the most important factor then you maybe able to produce the shortest urls by relying on internationalization.
This can make a URI rather long (up to 9 ASCII characters for a single Unicode character), but the intention is that browsers only need to display the decoded form, and many protocols can send UTF-8 without the %HH escaping.
Keep in mind that Browsers work quite well with UTF-8, and twitter will have no trouble with these urls.
精彩评论