URL shortener with no database
I'd like to write a URL shortener that doesn't have to use a database. Instead, to have as few moving parts as possible, the script would just create a unique hash for my URL based on an algorithm (like md5, except an md5 would be 开发者_如何学JAVAtoo long). I'm not really sure how I'd go about doing this. Any advice?
If it matters, I'd prefer to write this in Ruby.
What you need, is a way to compress and decompress a String. Where the resulting compressed version is a string too. This is nearly impossible, because an URL is already very short. Encoding and lossless compression always add minimal overhead, which will result in a string that is larger than the original, for most URLS.
For very long URLs, however, it may work.
So, in the end, you will almost always need a lookup-table in storage (database).
Base64 is the most logical solution. On itself, however, Base64 encoding returns longer strings than the original, for short strings (which URL are, generally); due to the padding, mostly. So we'll also try with zlib, to compress the string.
require "uri"
require "base64"
require "zlib"
shortner_url = URI.parse("https://s.to")
long = "https://stackoverflow.com/questions/4818429/url-shortener-with-no-database"
url = URI.parse(long)
stripped = url.host + url.path
stripped.length #=> 66
# Let's see that Base64 on its own does not shorten the url.
encoded = Base64.encode64(stripped)
encoded.length #=> 90
# So, using zlib. To compress.
compressed = Zlib::Deflate.deflate(stripped)
encoded = Base64.encode64(compressed)
encoded.length #=> 94
# It became worse.
# Now, with a long url (they can be much longer even), in a oneliner; to simplify omit the stripping part:
long = "http://www.thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com/wearejustdoingthistobestupidnowsincethiscangoonforeverandeverandeverbutitstilllookskindaneatinthebrowsereventhoughitsabigwasteoftimeandenergyandhasnorealpointbutwehadtodoitanyways.html"
long.length #=> 263
Base64.encode64(Zlib::Deflate.deflate(long)).length #=> 228
# In order to turn this into a valid short URL, however, we need `urlsaf_encode64()`
shortner_url.path = "/" + Base64.urlsafe_encode64(Zlib::Deflate.deflate(long))
shorther_url.to_s #=> "https://s.to/eJxNjkEWwyAIRG-U7HsbElFpEPIE68vti6t2BcwbZn51v1_7PufcvCKrFDRnMtf8u81HzuA_IWkDEoGG4EtiMN9ObftE6Pgey0FSvK6gIx7GTUl0GsmJSz1Biqpk7fjBDpL-xjGcopKYWfWyiySBRBFJABw9UnB9xaWj1LDCQWUGAQYzBVLECPbyxFLBJDqA7-DxSJ5YIbkGnoM8Ex7bqjf-AiodbYM="
shortner_url.to_s.length #=> 237 WE SAVED 26 characters!
Note on stripping: can remove 'https://'. A Real implementation would need to add a piece to the string, to determine https or http: '1'+result for https, '0'+result for http. Another "hack" would be to make the url-shortening service use http for http urls and https for https urls.
If you always have the same domain, you can disgard the domain part too.
If you have a lot of slashes, or other repeating characters such as a dash, the compression works better.
You could do this with several of the string manipulation tools available to transform a URL into something obscured however as you noted in your question the url's you get from doing this would be longer than is typical for a url shortener.
url's don't compress very well.
Ultimately if you're after a short link, you simply need to generate a suitably legible unique code (try to omit similar letters/numbers such as zero and 'o', in case some poor bugger actually has to type it in) and associate that code with the original URL in some form of store.
Whilst I can understand why you don't want to use a database, in many ways it's the perfect form of storage, especially if you look at one of the dedicated key/value stores such as Cassandra, Redis, MongoDB, etc. (That said, a simple "traditional" SQL database may be an easy first step if you're in unfamiliar territory.)
You won't be able to resolve the original URL from a hash code without looking it up in some kind of database.
About the only thing you can do without a database is compress the URL and then decompress it when you resolve the URL.
Strictly speaking, I guess you could just hash the URL. But of what possible value would that be if you are not able to resolve it back to the original URL?
精彩评论