开发者

Optimal concurrency model for scalable databases, indexes?

i'm interesting in concurrency开发者_开发技巧 technics which are relatively easy to implement and they are suitable for scaling (multiple nodes).

also if you know some more advanced algorithms, please provide some info.

hope this topic will be useful for others. thanks!

update

i'm interesting in nosql storages and models.


you need to think through exactly what you're looking for. does the "scalable" refer to the size of the DB? number of readers? number of simultaneous writers? throughput of readers in the presence of not-inherently-conflicting writers? by "index", do you mean a traditional totally-ordered index like a btree, or will hashing do? also, what degree of consistency do you want among readers and writers?

the previous comment suggests hashing, which is excellent if you don't need a btree-like ordered index precisely because the hashing operation splits the keyspace into separate pieces, so that simultaneous operations avoid interacting. range queries, on the other hand, want a tree index of some sort, which has problems because a single update can lock out all other queries as it adjusts the index. the traditional solution to this http://en.wikipedia.org/wiki/Multiversion_concurrency_control


The ideal approach to maximize concurrency capabilities in a database is to use a "sparsely populated" table with hashed-key. This allows instant retrieval of records by PK (or a surrogate that can get you quickly to the PK) and pretty much eliminates collisions. There is no need to read an index to determine where the record "lives" in the table because its location can be computed from the hashed PK. However, by maximizing concurrency capabilities in this way you are likely to suffer some other tradeoff, such as the inability to retrieve all records for, say, a particular zipcode quickly, or the inability to order the table instantly by some date value. A quick fetch of all records for a given zipcode, or ordering the rows instantly by a date column, would typically physically group those records making them contiguous on disk, in order to avoid excessive disk-thrashing. A sparsely populated table with hashed key can involve significant disk thrashing when groups of records are fetched (e.g. all customers in New York) while it excels when an individual record is fetched (customer # 123456).

EDIT: I should add that in such hashed-key sparsely populated databases, it is not unusual to find composite-primary-keys such as ZIPCODE*CUSTOMERNUMBER so that all of the customers in a given zipcode end up being stored in roughly the same region of the sparsely populated table; this is done to minimize thrashing when zipcode-driven reports are run. So there are ways to mitigate the adverse effects of the approach while preserving its exceptionally low collision rates and no-index-required record fetches.

EDIT2: And let's say that you wanted to make the email address the PK, and yet did not want to have the records from everyone from AOL or YAHOO or GOOGLE be clustered in the same region of the sparesely populated table, resulting in "bulges" there, like an anaconda that had swallowed a pig. You can use a left-weighted primary key hashing algorithm to put more emphasis on the values to the left of the @. But if you were to use a numeric sequential PK, you could use a right-weighted algorithm to eliminate such bulges.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜