SQL - How to tag data?
I need to store short tags (A01, B34, etc) in a SQL table, and make sure their indexed. Creating an INT column for each letter in the alphabet is not possible, because entries can have multiple 'A' tags for example.
First I stored them as a long string, seperated with spaces (for example "A01 B34"). But this requires a LIKE% query, which does a fulltable scan and ignores any indexes. So i'm looking for alternatives.
I now use SQLite FTS (text search) to search for these tags, but this requires a special table to store the tags in, and fetching results with J开发者_开发问答OIN queries, and all kinds of other stuff I'd rather like to avoid.
My requirements are pretty simple: I need to store millions of short strings, each with their own tags, and do simple searches for these tags.
Is my current approach (doing FTS on the tags) the fastest? Or is it better to use a NoSQL database for this kind of data?
I will share my experience how I have done it in my previous startup Pageflakes Community site. At Pageflakes, user created content is tagged. You can see an example from here:
http://www.pageflakes.com/Community/Content/Flakes.aspx?moduleKey=4796
Each widget, pagecast has a collection of tags. When someone searches, we give the tags highest priority, then the title, then the description of the item.
Assuming you have a Content table like this:
Content (id, title, description)
First of all, you need to create a table for all unique tags.
Tag (id, name (unique), countOfItems)
Then you need to map the tag to content rows.
TagContentMap (tagid, contentid)
You will now ask, for each new content, I have to insert in three tables. Not always. You insert in Tag table only when you have a new tag. Most of the time, people choose existing tags. After couple of months of tagging, users should have exhausted unique tags. From then, 99% of the time users pick some existing tag. So, that removes one insert for you. So, you only have one additional insert.
Moreover, insert is always significantly lower than select. Most likely you will have 99% read, 1% write.
Unless you introduce these two tables, you can never have a UI where users can click on a tag and see all the content tagged with that particular tag. If you have no need for such feature, then of course you can just add a "tags" column on the Content table itself and store the tags in comma delimited format.
Now the most important point - how to produce the best search result. On the content table, we have a varchar field called "SearchData". This field is first populated with the tag names, then the title, then the description. So,
SearchData = tag names comma delimited + newline + title + newline + description.
Then you use SQL Server's Full text indexing to index the SearchData column only, not any other field in the Content table.
Does this work for you?
You do not give us a lot of details to go on, but your design seems to be all wrong. It is not in third normal form.
@Joshua, pls goo on term "normalization". Currently your data is denormalized. Denormalization is possible thing. but after normalization and as some kind of perfomance hack. Currently your design seems to be wrong.
As an example you should have insetad of 1 table 3 tables:
some_records (id, column1, ..., columnN)
tags (id, title)
some_records_tags (some_record_id, tag_id)
It's a classic design pattern in DBMS. And NoSQL here not needed.
As other users have pointed out, the data is not well normalized. I'll assume that this is intentional and there is some very large (100s of gb or tb size requirement or huge throughput requirement that you haven't mentioned). But before you start down any path, you should understand exactly what your requirements are: how often do you write versus read, what are the latency requirements for writes and reads, and you have to include index maintenance in your calculations.
If you have a significant perf requirement, you might try building a near-line index system on top of what you currently have. I've used this technique in the past for large throughput requirement systems. The idea is basically that for writes, you make them as small and quick as possible, and create a batch process to come back and add the data into a secondary search table that will get it into a form that is capable of being searched. The benefit is your writes can be done quickly, and if you choose your clustered index well the reads for the batch processing can be done very efficiently. In addition, you can segment the data into different servers as necessary to support higher search throughput. The major drawback is that updates are not instantaneously reflected in search results.
If you write into a table like: table data (id binary(6), ..., timestamp datetime, tags varchar(256))
and have a secondary table: table search (tag char(3), dataId binary(6))
You can create a batch process to come around take the last BATCH_SIZE (1000 maybe) records and splitting the tags column on a space and inserting/deleting the tags into/from the search table. You keep a variable/row somewhere with the last timestamp value you've collected from and start from there at the next batch interval. Finally, if deletes are important, each batch interval will need to find the set of records no longer in the data table. Alternately you could choose a tombstone table if your data table is too large or you can concurrently perform deletes against data and search if they happen infrequently enough.
Things to watch out for with batch processing is making the batch size too big and taking table locks when updating the search table. Also, you have to watch out for duplicate timestamps. And of course, when writing/updating the data table it is necessary to always update the timestamp.
精彩评论