开发者

SQL Design - How to store large amount of URLs

I'm writing an application that will have a SQL Server backend that will store (among other things) urls. URLS will be mapped to users, and some URLs may be common between different users. In absence of a true DBA, I'm trying to design a solution that can handle hundreds of thousands of URLs as efficiently as possible.

Ideas:

  1. Create table that simply has ID, URL

    Pro: simple, complete.

    CON: duplicate entries for a URL will exist wh开发者_如何学Pythonich will cause the table to be larger than it needs to be.

  2. Break up the user and URLs into separate tables. One table containing USER ID, and URL ID . Another table with URL ID and URL itself.

    Pro: single URL in the system, seems more "enterprisey"

    Con: must join two tables when trying to pull back results, and not really sure what the benefit of this approach is?

  3. Expand on the 2 idea, except REALLY break it up. So have a table for domain, another for path/query string. Then, user table would have userid, domain ID, path ID.

    Pro: urls could share data even if it was unrelated (meaning, cnn.com/helloworld and nbc.com/helloworld would have different domain ids, but same path ids.. seems this could be useful when running metrics later?

    Con: Seems like a nightmare from a performance perspective (again, because joins would be necessary to pull a URL.

Any thoughts?


I would do the following in my design:

UserId  UrlId
1       1
2       2
1       1

UrlId  Url
1      http://www.google.com
2      http://www.yahoo.com

Storing your URLs in a seperate table and only creating a new entry in the URL table, if an exact match does not already exist. If you have a lot of common URLs, this will save some space. You could take it a step farther and add a third table as you mentioned, e.g.

UrlPathId  UrlId  UrlPath
1          1      /shopping

...and then tieing the UrlPathId to the User table. And perhaps even further:

UrlPathId  UrlId  UrlQueryString
1          1      ?product=speakers

...and again, referencing this from your User table.


It sounds like you are describing a many to many relationship between users and URL's.

I would highly suggest ruling out option 1. Not only will this increase size, but because if you need to update a URL or a User, you'll have to do it every time that it's duplicated, instead of once.

Choosing between 2 and 3 is more difficult, because it depends much more on how this is going to be used. #2 is a lot more simplistic, and is still normalized. The features in #3 don't seem to outweigh the complexity to me, so personally I'd pick #2.

Edit: Upon seeing George's answer, I completely agree with the first section.


Are you really that short on space? Unless you need to treat URLs as an object in their own right I would just go for option 1 and cover it with indexes if you have specific performance requirements on URLs alone.

See my other comment here on dealing with orphan URLs.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜