开发者

What is meant by sparse data/ datastore/ database?

Have been reading up on Hadoop and HBase lately, and came across this term-

HBase is an open-source, distributed, sparse, column-oriented store...

What do they mean by sparse? Does it have something to do with a sparse matrix? I am guessing it is a property of the type of data it can store efficiently, and hence, would like to know more开发者_JAVA技巧 about it.


In a regular database, rows are sparse but columns are not. When a row is created, storage is allocated for every column, irrespective of whether a value exists for that field (a field being storage allocated for the intersection of a row and and a column).

This allows fixed length rows greatly improving read and write times. Variable length data types are handled with an analogue of pointers.

Sparse columns will incur a performance penalty and are unlikely to save you much disk space because the space required to indicate NULL is smaller than the 64-bit pointer required for the linked-list style of chained pointer architecture typically used to implement very large non-contiguous storage.

Storage is cheap. Performance isn't.


Sparse in respect to HBase is indeed used in the same context as a sparse matrix. It basically means that fields that are null are free to store (in terms of space).

I found a couple of blog posts that touch on this subject in a bit more detail:

http://blog.rapleaf.com/dev/2008/03/11/matching-impedance-when-to-use-hbase/

http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable


At the storage level, all data is stored as a key-value pair. Each storage file contains an index so that it knows where each key-value starts and how long it is.

As a consequence of this, if you have very long keys (e.g. a full URL), and a lot of columns associated with that key, you could be wasting some space. This is ameliorated somewhat by turning compression on.

See: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

for more information on HBase storage


There are two way of data storing in the tables it will be either Sparse data and Dense data. example for sparse data.

Suppose we have to perform a operation on a table containing sales data for transaction by employee between the month jan2015 to nov 2015 then after triggering the query we will get data which satisfies above timestamp condition if employee didnt made any transaction then the whole row will return blank

eg. EMPNo Name Product Date Quantity

 1234  Mike    Hbase    2014/12/01     1
 5678                                        
 3454  Jole    Flume    2015/09/12   3

the row with empno5678 have no data and rest of the rows contains the data if we consider whole table with blanks row and populated row then we can termed it as sparse data.

If we take only populated data then it is termed as dense data.


The best article I have seen, which explains many databases terms as well.

> http://jimbojw.com/#understanding%20hbase

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜