What is meant by sparse data/ datastore/ database?
Have been reading up on Hadoop and HBase lately, and came across this term-
HBase is an open-source, distributed, sparse, column-oriented store...
What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more开发者_JAVA技巧 about it.
In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).
This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.
Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.
Storage is cheap. Performance isn't.
Sparse in respect to HBase is indeed used in the same context as a sparse matrix. It basically means that fields that are null are free to store (in terms of space).
I found a couple of blog posts that touch on this subject in a bit more detail:
http://blog.rapleaf.com/dev/2008/03/11/matching-impedance-when-to-use-hbase/
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.
As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.
See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
for more information on HBase storage
There are two way of data storing in the tables it will be either Sparse data and Dense data. example for sparse data.
Suppose we have to perform a operation on a table containing sales data for transaction by employee between the month jan2015 to nov 2015 then after triggering the query we will get data which satisfies above timestamp condition if employee didnt made any transaction then the whole row will return blank
eg. EMPNo Name Product Date Quantity
1234 Mike Hbase 2014/12/01 1
5678
3454 Jole Flume 2015/09/12 3
the row with empno5678 have no data and rest of the rows contains the data if we consider whole table with blanks row and populated row then we can termed it as sparse data.
If we take only populated data then it is termed as dense data.
The best article I have seen, which explains many databases terms as well.
> http://jimbojw.com/#understanding%20hbase
精彩评论