Searches (and general querying) with HBase and/or Cassandra (best practices?)

2022-12-26 05:23 问答作者：

I have User model object with quite few fields (properties, if you wish) in it. Say "firstname", "lastname", "city" and "year-of-birth". Each user also gets "unique id".

I want to be able to search by them. How do I do that properly? How to do that at all?

My understanding (will work for pretty much any key-value storage -- first goes key, then value)

u:123456789 = serialized_json_object

("u" as a simple prefix for user's keys, 123456789 is "unique id").

Now, thinking that I want to be able to search by firstname and lastname, I can save in:

f:Steve = u:384734807,u:2398248764,u:23276263 f:Alex = u:12324355,u:121324334

so key is "f" - which is prefix for firstnames, and "Steve" is actual firstname. For "u:Steve" we save as value all user id's who are "Steve's".

That makes every search very-very easy. Querying by few fields (properties) -- say by firstname (i.e. "Steve") and lastname (i.e. "l:Anything") is still easy - first get list of user ids from "f:Steve", then list from "l:Anything", find crossing user ids, an here you go.

Problems (and there are quite a few):

Saving, updating, deleting user is a pain. It has to 开发者_开发问答be atomic and consistent operation. Also, if we have size of value limited to some value - then we are in (potential) trouble. And really not of an answer here. Only zipping the list of user ids? Not too cool, though.
What id we want to add new field to search by. Eventually. Say by "city". We certainly can do the same way "c:Los Angeles" = ..., "c:Chicago" = ..., but if we didn't foresee all those "search choices" from the very beginning, then we will have to be able to create some night job or something to go by all existing User records and update those "c:CITY" for them... Quite a big job!
Problems with locking. User "u:123" updates his name "Alex", and user "u:456" updates his name "Alex". They both have to update "f:Alex" with their id's. That means either we get into overwriting problem, or one update will wait for another (and imaging if there are many of them?!).

What's the best way of doing that? Keeping in mind that I want to search by many fields?

P.S. Please, the question is about HBase/Cassandra/NoSQL/Key-Value storages. Please please - no advices to use MySQL and "read about" SELECTs; and worry about scaling problems "later". There is a reason why I asked MY question exactly the way I did. :-)

Being able to query properties directly is one of the features you lose when moving away from SQL, so you need a way to maintain your own index to let you find records.

If your datastore does not have built in indexing or atomic list operations, you will need to deal with the locking issues you mention. However, indexing doesn't necessarily need to be synchronous - maintain a queue of updated records to be reindexed and you have a solution for 3 that can be reused to solve 2 also.

If the index list for a particular value becomes too large for the system to handle in a single list, you can replace the list of users with a list of lists. However, if you have that many records with the same value it probably isn't a particularly useful search criteria anyway.

Another option that is useful in some cases is to use a seperate system for the indexing - for example you could set up lucene to index the records in your main datastore.

I guess i would have implemented this as a MapReduce job, which would run on schedule. Each search word, would be a row-key with lookup to UID.

Rowkey:uid1
profile:firstName: Joe
profile:lastName: Doe
profile:nick: DoeMaster

Rowkey: uid2
profile:firstName: Jane
profile:lastName: Doe
profile:nick: SuperBabe

MapReduse indexes all searchable properties and add them with search word as row key

Rowkey: Jane
lookup:uid: uid2

Rowkey: Doe
lookup:uid: uid2, uid1

Rowkey: DoeMaster
lookup:uid: uid1

..etc

Now, if you need to update the index list on the fly as a user change, you would write the change directly to the index base, by remove uid value from index and add to another row key. In case of this happens at the same time, temporary locking could be implemented.

For users being removed, an additional attribute telling the state of the user could be use to filter them out from search.

Adding additional search word isn't very hard, since its just about which name:value you want to index. you could filter search more also by adding type attribute to your row key/keyword. i.e boston - lookup:type: city.

The idea is to maintain your own row key based search index inside hbase.

继续阅读：cassandra hbase nosql

Searches (and general querying) with HBase and/or Cassandra (best practices?)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？