What is a columnar database?
I have been working with warehousing for a while now.
I am intrigued by Columnar Databases and the speed that they have to offer for data retrievals.
I have multi-part question:
- 开发者_如何学Go
- How do Columnar Databases work?
- How do they differ from relational databases?
How do columnar databases work? The defining concept of a column-store is that the values of a table are stored contiguously by column. Thus the classic supplier table from CJ Date's supplier and parts database:
SNO STATUS CITY SNAME
--- ------ ---- -----
S1 20 London Smith
S2 10 Paris Jones
S3 30 Paris Blake
S4 20 London Clark
S5 30 Athens Adams
would be stored on disk or in memory something like:
S1S2S3S4S5;2010302030;LondonParisParisLondonAthens;SmithJonesBlakeClarkAdams
This is in contrast to a traditional rowstore which would store the data more like this:
S120LondonSmith;S210ParisJones;S330ParisBlake;S420LondonClark;S530AthensAdams
From this simple concept flows all of the fundamental differences in performance, for better or worse, between a column-store and a row-store. For example, a column store will excel at doing aggregations like totals and averages, but inserting a single row can be expensive, while the inverse holds true for row-stores. This should be apparent from the above diagram.
How do they differ from relational databases? A relation database is a logical concept. A columnar database, or column-store, is a physical concept. Thus the two terms are not comparable in any meaningful way. Column- oriented DMBSs may be relational or not, just as row-oriented DBMS's may adhere more or less to relational principles.
How do Columnar Databases work?
Columnar database is a concept rather a particular architecture/implementation. In other words, there isn't one particular description on how these databases work; indeed, several are build upon traditional, row-oriented, DBMS, simply storing the info in tables with one (or rather often two) columns (and adding the necessary layer to access the columnar data in an easy fashion).
How do they differ from relational databases? They generally differ from traditional (row-oriented) databases with regards to ...
- performance...
- storage requirements ...
- ease of modification of the schema ...
...in specific use cases of DBMSes.
In particular they offer advantages in the areas mentioned when the typical use is to compute aggregate values on a limited number of columns, as opposed to try and retrieve all/most columns for a given entity.
Is there a trial version of a columnar database I can install to play around? (I am on Windows 7)
Yes, there are commercial, free and also open-source implementation of columnar databases. See the list at the end of the Wikipedia article for starter.
Beware that several of these implementations were introduced to address a particular need (say very small footprint, highly compressible distribution of data, or spare matrix emulation etc.) rather than provide a general purpose column-oriented DBMS per-se.
Note: The remark about the "single purpose orientation" of several columnar DBMSes is not a critique of these implementations, but rather an additional indication that such an approach for DBMSes strays from the more "natural" (and certainly more broadly used) approach to storing record entities. As a result, this approach is used when the row-oriented approach isn't satisfactory, and therefore and tends to
a) be targeted for a particular purpose
b) receive less resources/interest than work on "General Purpose", "Tried and Tested", tabular approach.
Tentatively, the Entity-Attribute-Value (EAV) data model, may be an alternative storage strategy which you may want to consider. Although distinct from the "pure" Columnar DB model, EAV shares several of the characteristics of Columnar DBs.
I would say the best candidate to understand about column oriented databases is to check HBase (Apache Hbase) . You an checkout the code and explore further to find out about the implementation .
Also, Columnar DBs have a built in affinity for data compression, and the loading process is unique. Here's an article I wrote in 2008 that explains a bit more.
You may also be interested in a new report from IDC's Carl Olofson on 3rd generation DBMS technology. It discusses columnar, et al. If you're not an IDC client you can get it free on our site. He's doing a webinar on June 16th, too (also on our site).
(BTW, one comment above lists asterdata but I don't think they are columnar.)
To understand what is column oriented database, it is better to contrast it with row oriented database.
Row oriented databases (e.g. MS SQL Server and SQLite) are designed to efficiently return data for an entire row. It does it by storing all the columns values of a row together. Row-oriented databases are well-suited for OLTP systems (e.g., retail sales, and financial transaction systems).
Column oriented databases are designed to efficiently return data for a limited number of columns. It does it by storing all of the values of a column together. Two widely used Column oriented databases are Apache Hbase and Google BigTable (used by Google for its Search, Analytics, Maps and Gmail). They are suitable for the big data projects. A column oriented database will excel at read operations on a limited number of columns, however write operation will be expensive compared to row oriented databases.
For more: https://en.wikipedia.org/wiki/Column-oriented_DBMS
Product information. This may help. These were to featured products on a Google search.
http://www.vertica.com/
http://www.paraccel.com/
http://www.asterdata.com/index.php
kx is another columnar database, for example used in the financial sector. The licence is somewhat $50K last time I checked, though. No optimisation needed, no index needed, because kx has powerful operators (matlab equivalents: .*
, kron
, bsxfun
, ...).
精彩评论