开发者

Is it possible to have an SQL table with over a million columns?

I am building a database for microarray data. Each patient sample has over 1,000,000 features and I would like to store the patient samples as rows in an SQL table with each feature as a column.

                 HuEX Microarray Data
+----+----------+----------+-----+------------------+
| ID | Feature1 | Feature2 | ... | Feature1,000,000 |
+----+----------+----------+-----+------------------+
| 1  |   2.3543 |  10.5454 | ... |          5.34333 |
| 2  |  13.4312 |   1.3432 | ... |         40.23422 |
+----+----------+----------+-----+------------------+

I know most relational database systems have limits on the number of columns in a table.

+------------+-----------------+
|       DBMS | Max Table Col # | 
+------------+-----------------+
| SQL Server |  1,024 - 30,000 |
|      MySQL |    65,535 bytes |
| PostgreSQL |     250 - 1,600 |
|     Oracle |           1,000 | 
+------------+-----------------+

Obviously these limitations are too low for my task. Is there anyway to increase the number of columns an SQL database table can have or is there another DBMS that can handle such high number of table columns?

Update

Note all the columns will have values for all the rows.开发者_Python百科


Don't.

Event if you can make it work, it will be very slow and unwieldly.

Instead, you should make a separate table with columns for PatientID, Feature, and Value.
This table would have one row for each cell in your proposed table.

It also makes it possible to add additional information about each patient-feature pair.


You'd normally split (normalize) the tables:

Sample: ID, PatientID
Feature: ID, Name
SampleFeature: SampleID, FeatureID, value

SQL databases can't handle a lot of columns, but they can handle a lot of rows.


Try rearranging your table to:

CREATE TABLE MicroarrayData (
    SampleID  INTEGER,
    FeatureID INTEGER,
    Value     REAL,
    PRIMARY KEY (SampleID, FeatureID)
);


This is actually a use case for an Entity-Attribute-Value Model (EAV), and may actually be better suited to non RDBMS/SQL solutions in some intense environments. (Relational database are workhorses though ... might as well use one until it's provably not sufficient ;-)

From the Wikipedia article:

Entity-attribute-value model (EAV) is a data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. In mathematics, this model is known as a sparse matrix.

Happy coding.


Well, going with the new information that this is dense array of homogenous numeric (double) values and querying is important (that is, I will disregard de-normalization into blobs/XML and the use of special UDFs), I propose the following:

Split each result into multiple records, where each record is of the form:

ID, SEGMENT, IDx ... // where x is [0, q]

The value of q is arbitrary but should be chosen based upon particular database implementation (e.g. try to fit into the 8k record size in SQL Server) for performance/efficiency reasons.

Each result will then be split into the records such that SEGMENT refers to the segment. That is the "absolute index" of a given feature is n = SEGMENT * q + x and feature n will be found in the record where SEGMENT = n / q. It then follows that the Primary Key is (ID, SEGMENT).

Thus querying is still easy -- the only change is the conversion to/from the segment -- with the only additional requirement being the SEGMENT (this column may also participate in an index).

(A separate table can be used to map features to SEGMENT/x or otherwise. In this way it is similar to an EAV Model.)

Thus, while similar in some ways to a fully normlized form it takes advantage of the packed/homogenous/static-feature nature of the initial matrix to significantly reduce the number of records -- while 2 million records is an arguably small table and 20 million records is only a "mid-sized" table, 200 million records (results from 200 chips x 1 million features per chip, if each feature results in a record) starts to become daunting. While the same complexity, a q of 200 would reduce the number of records to a mere 10 million. (Each compacted record is also much more efficient in terms of data/structure ratio.)

Happy coding.


While the above is one tentative "what if" suggestion on my part, I would encourage exploring the problem more -- in particular, the exact data-access patterns required. I'm not sure this is a "typical" usage of a standard RDBMS and an RDBMS may not even be a good way to approach this problem.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜