Database table PK
I have a table where I store comments for user users. I will have 100 Million+ comments.
2 ways I can create it:
Option 1: user name and comment id as PK. That way all comments are stored physically by user name and comment id.
CREATE TABLE [dbo].[Comments](
[user] [varchar](20) NOT NULL,
[com_id] [int] IDENTITY(1,1) NOT NULL,
[com_posted_by] [varchar](20) NOT NULL,
[com_posted_on] [smalldatetime] NOT NULL CONSTRAINT DEFAULT (getdate()),
[com_text] [nvarchar](225) COLLATE NOT NULL,
CONSTRAINT [PK_channel_comments] PRIMARY KEY CLUSTERED
([channel] ASC, [com_id] ASC) WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]) ON [PRIMARY]
Pros: My query will be get all or top 10 comments for a user order by comment_id DESC. This is SEEK
Option 2: I can make the comment id as the PK. That will store the comments sorted by the comment id, not user name.
Cons: Getting latest top 10 comments of a given user is not a seek anymore as data not sto开发者_运维问答red by user (ie. not sorted by user). So I have to create other index to improve the query performance.
Which way is best way to proceed? How about insertion and deletion? These operations are allowed. But read is frequent.
User can't modify their comments.
I tested both tables with 1.1M rows. Here is the result:
table_name rows reserved data index_size unused
comments2 1079892 99488 KB 62824 KB 36576 KB 88 KB (PK: com_id Second Index on (user_name, com_id))
comments1 1079892 82376 KB 82040 KB 328 KB 8 KB (PK: user_name, no other indices)
--------------------------------------------------------------------
diff: same rows 17112KB -19216KB 36,248KB 80KB
So the table with com_id as PK is using 36MB extra disk space just for the 2 index The select top query on both table using SEEK, but table with com_id as PK is slower But insertion is slightly faster when I have com_id as PK
Any comments?
I would use the Comment ID as the Primary Key for the table. If you are going to have a lot of queries that use the Comment ID and the User name, its probably simpler just to add an Index on those fields.
I would not use User name in a PK as it may change, creating cascade update issues later. Also, concatenating those two into the PK creates a large(r) PK that might have to be passed to other tables as a FK. I try to keep PK that appear as FKs as small as possible, unless I know I will want all the PK of the contributing tables in one large key for speed of query. Comment id should be fine. You may need to create an additional index for fast searching on comment id and user name. Will you be doing more insertions/updates or queries? if query intensive, then the index is not an issue.
Are you sure that you have that CREATE TABLE statement correct? You're using [Channel] in the PK definition, and I don't see that as a column. Did you mean [User].
Do you have a user table someplace? If so, you might save a lot of overhead by keying that on an integer value and putting UserID into the comments table, rather than User.
I would PK on the CommentID and then add a non-clustered index on [UserID, CommentID]. That gives you immediate access to a comment by ID (for deleting, etc) without having to involve the UserID value in the WHERE clause; and it provides quick access to the user's comments. I do not, however, tend to work with table of the size you anticipate.
As a rule of thumb, always choose the narrowest PK. Then, to improve performance, you may want to use an integer based User_id, instead of a varchar, and add an index for both columns.
The best approach will depends on the number of users, if you have just a few users the commet_id user_id pk could be better (additionally, parttition by user would be an option); in the other hand, if the number of users are high, a combined Pk will be useless.
My initial approach would be to make CommentID alone the PK, maybe in descending order so you don't have to do any reordering on select. Then put an index on UserID.
If you use the concatenated key, consider switching CommentID to desc.
精彩评论