SQL Design - How to store large amount of URLs
I'm writing an application that will have a SQL Server backend that will store (among other things) urls. URLS will be mapped to users, and some URLs may be common between different users. In absence of a true DBA, I'm trying to design a solution that can handle hundreds of thousands of URLs as efficiently as possible.
Ideas:
Create table that simply has ID, URL
Pro: simple, complete.
CON: duplicate entries for a URL will exist wh开发者_如何学Pythonich will cause the table to be larger than it needs to be.Break up the user and URLs into separate tables. One table containing
USER ID
, andURL ID
. Another table withURL ID
andURL
itself.Pro: single URL in the system, seems more "enterprisey"
Con: must join two tables when trying to pull back results, and not really sure what the benefit of this approach is?Expand on the 2 idea, except REALLY break it up. So have a table for domain, another for path/query string. Then,
user
table would haveuserid, domain ID, path ID
.Pro: urls could share data even if it was unrelated (meaning,
cnn.com/helloworld
andnbc.com/helloworld
would have different domain ids, but same path ids.. seems this could be useful when running metrics later?Con: Seems like a nightmare from a performance perspective (again, because joins would be necessary to pull a URL.
Any thoughts?
I would do the following in my design:
UserId UrlId
1 1
2 2
1 1
UrlId Url
1 http://www.google.com
2 http://www.yahoo.com
Storing your URLs in a seperate table and only creating a new entry in the URL table, if an exact match does not already exist. If you have a lot of common URLs, this will save some space. You could take it a step farther and add a third table as you mentioned, e.g.
UrlPathId UrlId UrlPath
1 1 /shopping
...and then tieing the UrlPathId to the User table. And perhaps even further:
UrlPathId UrlId UrlQueryString
1 1 ?product=speakers
...and again, referencing this from your User table.
It sounds like you are describing a many to many relationship between users and URL's.
I would highly suggest ruling out option 1. Not only will this increase size, but because if you need to update a URL or a User, you'll have to do it every time that it's duplicated, instead of once.
Choosing between 2 and 3 is more difficult, because it depends much more on how this is going to be used. #2 is a lot more simplistic, and is still normalized. The features in #3 don't seem to outweigh the complexity to me, so personally I'd pick #2.
Edit: Upon seeing George's answer, I completely agree with the first section.
Are you really that short on space? Unless you need to treat URLs as an object in their own right I would just go for option 1 and cover it with indexes if you have specific performance requirements on URLs alone.
See my other comment here on dealing with orphan URLs.
精彩评论