Storing Scientific Data in a Relational Database
I want to store hierarchical, two-dimensional scientific datasets in a relational database (MySQL or SQLite). Each dataset contains a table of numerical data with an arbitrary number of columns. I开发者_如何学JAVAn addition, each dataset can have one or more children of the same type associated with a given row of its table. Each dataset typically has between 1 and 100 columns and between 1 and 1.000.000 rows. The database should be able to handle many datasets (>1000) and reading/writing of data should be reasonably fast.
What would the best DB schema to store such kind of data? Is it reasonable to have a "master" table with the names, IDs and relations of individual datasets and in addition one table per dataset which contains the numerical values?
Is it reasonable to have a "master" table with the names, IDs and relations of individual datasets and in addition one table per dataset which contains the numerical values?
That's how I'd do it.
I'm not exactly sure how the 'arbitrary columns' thing is working, because data usually doesn't work like that. Regardless, it sounds like storing it as row,col,val might work nicely.
Honestly though, if you don't need to search through it (max, min, etc.), it might be better to use some kind of flat file.
An alternative setup that might be interesting is using SQLite, with a separate database file for each dataset, plus one master one.
Whatever you pick, how well it will work really depends on what you're going to do with the data.
You're going to end up trading off flexibility for performance, I think. You can hard-code your db schema, which it sounds like you want to avoid, but would give you the best performance, or
leave the schema determined at runtime, stored in a 'master' table, which increases your flexibility, but reduces your ability to enforce referential integrity and set data types.
for awhile, you could try both approaches until you have enough info about which will perform better for your task.
It's hard to be specific without understanding the problem domain, but if your data is inherently relational, use a relational model. If your data is not inherently relational, I wouldn't try to force it into a relational model for the sake of it - the fact that all dataset happen to have an ID doesn't mean those IDs are the same. Or even that they are suitable for use as a primary key.
I'd suggest starting by having each data set in its own table (or tables if there are child records), and create a master table if you need to.
I'd share zebediah49's question on "are you really going to use a database for this? Wouldn't flat files be better?"
We store a bunch of data like this in their own flat file. The header of the file contains enough information (timestamp, number of rows/cols...etc) so that it can be read. Then a meta information about this data is in the database. At minimum this is the file location, but could include other information about the data. For example we aggregate the data into proxy variables that summarize the details at a high level. Typically, this summary data is good enough, but when necessary we can read the file for all the details.
精彩评论