开发者

Blob data in huge SQL Server database

We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).

We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.

I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.

The database will be grow hug开发者_JAVA百科e over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.

I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?

Any other suggestions on how to manage such data volumes?


I'd say keeping the files in the filesystem would be a better idea. And you can keep file name and path in the DB. Here's a similar question.


Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.

I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.

Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.


I assume by "generated" you mean something like data are being injected into document templates, and so there's much repetition of text content, i.e. "boilerplate" ?

20 million of such "generated" files per year is ~55,000 per day, ~2300 per hour!

I would manage such volume by not generating text files in the first place, and instead by creating database abstracts that contain the data that are pumped into the generated text, so that you can reconstitute the full document if necessary.

If you mean something else by "generated" could you elaborate?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜