开发者

Typical URL lengths for storage calculation purposes (URL-shortener)

After reading several of the hits on a quick google search, it seems there is not a whole lot of consistency when it comes to determining average URL length.

I know IE has a maximum URL length of 2083 characters (from here) - so I have a good maximum to work with.

My concern is that I am writing a URL-shortener in PHP (similar to some other qu开发者_运维知识库estions on SO), and want to make sure I am not likely to exceed the storage capability of the server hosting it.

If all URLs are the IE maximum, then 2^32 won't fit comfortably anywhere - it'd take 2K x 4B ~= 8TB of storage: an unrealistic expectation.

Without adding-in a trimming function (ie, purging "old" shortened URLs), what is the safest way to calculate storage usage of the app?

Is ~34 characters a safe guess? If so, then a fully-populated (using an int type for a primary key) database would chew 292GB of space (double 146GB for any meta data that may want to be stored).

What is the best-guess for an application such as this?


This is probably unknowable without indexing the entire Internet, but according to an analysis by Kelvin Tan on a dataset of 6,627,999 unique URLs from 78,764 unique domains, the answer is 76.97:

Mean: 76.97

Standard Deviation: 37.41

95th% confidence interval: 157

99.5th% confidence interval: 218


I'm not sure what is typical, but of 11,000 urls in our request database, the average length is 62 characters. There are hundreds of urls with several hundred characters. The longest is a Google Translate link at 1689 characters.

top 10 len(producturl):
1689
792
707
693
647
606
574
569
562
560

sample url 647 characters:

http://www.amazon.co.jp/%E9%AD%94%E7%95%8C%E6%88%A6%E8%A8%98%E3%83%87%E3%82%A3%E3%82%B9%E3%82%AC%E3%82%A4%E3%82%A24-%E5%88%9D%E5%9B%9E%E9%99%90%E5%AE%9A%E7%89%88-%E5%A0%95%E5%A4%A9%E4%BD%BF%E3%83%95%E3%83%AD%E3%83%B3-%E3%83%97%E3%83%AD%E3%83%80%E3%82%AF%E3%83%88%E3%82%B3%E3%83%BC%E3%83%89%E4%BB%98%E3%81%8D%E7%89%B9%E8%A3%BD%E3%82%AB%E3%83%BC%E3%83%89-%E3%83%88%E3%83%AC%E3%83%BC%E3%83%87%E3%82%A3%E3%83%B3%E3%82%B0%E3%82%AB%E3%83%BC%E3%83%89%E3%80%8C%E3%83%B4%E3%82%A1%E3%82%A4%E3%82%B9%E3%82%B7%E3%83%A5%E3%83%B4%E3%82%A1%E3%83%AB%E3%83%84%E3%80%8D%E9%99%90%E5%AE%9APR%E3%82%AB%E3%83%BC%E3%83%89%E4%BB%98%E3%81%8D/dp/B0043RT8UO/ref=pd_rhf_p_t_1

P.S. for estimating purposes you should extrapolate from some dataset after applying standard deviation to throw out the outliers which could distort your mean.


From RFC 2068 section 3.2.1:

The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).

Note: Servers should be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations may not properly support these lengths.

Although IE (and probably most other browsers) support much longer URI lengths, I don't believe most forms or client-side apps rely on anything above 255 bytes working. Your server logs should provide some statistics about what kind of urls you are seeing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜