开发者

Dealing with large number of text strings

My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005

Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:

  • Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
  • Do you know what will be the best co开发者_运维问答mpression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
  • Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)

EDITED: I will keep adding to this to clarify points raised

  • I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
  • I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.


The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a varchar(4000) column and use (on average) 300 bytes each.

Then you have up to 200,000 of these to store in a database.

That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.

At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.

And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.

SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.


If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx

As an example, you can create a compressed table like this:

CREATE TABLE T1 
(c1 int, c2 nvarchar(50) )
WITH (DATA_COMPRESSION = PAGE);

If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like System.IO.Compression. I suppose you could try it, though.


Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.


Not entirely clear on what you are asking.

In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?

Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.

This would be an alternative to having something like EF generating essentially 200K insert statements.

UPDATE Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx


I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).

Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.


Sounds like you would benefit from using Large-Value Data Types

These data types will store up to 2^31-1 bytes of data

If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.


It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜