What should i do for accommodating large scale data storage and retrieval?

2023-02-01 12:23 问答作者：

There's two columns in the table inside mysql database. First column contains the fingerprint while the second one contains the list of documents which have that fingerprint. It's much like an inverted index built by search engines. An instance of a record inside the table is shown below;

34 "doc1, doc2, doc45"

The number of fingerprints is very large(can range up to trillions). There are basically following operations in the database: inserting/updating the record & retrieving the record accoring to the match in fingerprint. The table definition python snippet is:

self.cursor.execute("CREATE TABLE IF NOT EXISTS `fingerprint` (fp BIGINT, documents TEXT)")

And the snippet for insert/update operation is:

if self.cursor.execute("UPDATE `fingerprint` SET documents开发者_Python百科=CONCAT(documents,%s) WHERE fp=%s",(","+newDocId, thisFP))== 0L:
                self.cursor.execute("INSERT INTO `fingerprint` VALUES (%s, %s)", (thisFP,newDocId))

The only bottleneck i have observed so far is the query time in mysql. My whole application is web based. So time is a critical factor. I have also thought of using cassandra but have less knowledge of it. Please suggest me a better way to tackle this problem.

Get a high end database. Oracle has some offers. SQL Server also.

TRILLIONS of entries is well beyond the scope of a normal database. THis is very high end very special stuff, especially if you want decent performance. Also get the hardware for it - this means a decent mid range server, 128+gb memory for caching, and either a decent SAN or a good enough DAS setup via SAS.

Remember, TRILLIONS means:

1000gb used for EVERY BYTE.

If the fingerprint is stored as an int64 this is 8000gb disc space alone for this data.

Or do you try running that from a small cheap server iwth a couple of 2tb discs? Good luck.

That data structure isn't a great fit for SQL - the 'correct' design in SQL would be to have a row for each fingerprint/document pair, but querying would be impossibly slow unless you add an index that would take up too much space. For what you are trying to do, SQL adds a lot of overhead to support functions you don't need while not supporting the multiple value column that you do need.

A redis cluster might be a good fit - the atomic set operations should be perfect for what you are doing, and with the right virtual memory setup and consistent hashing to distribute the fingerprints across nodes it should be able to handle the data volume. The commands would then be

SADD fingerprint, docid

to add or update the record, and

SMEMBERS fingerprint

to get all the document ids with that fingerprint.

SADD is O(1). SMEMBERS is O(n), but n is the number of documents in the set, not the number of documents/fingerprints in the system, so effectively also O(1) in this case.

The SQL insert you are currently using is O(n) with n being the very large total number of records, because the records are stored as an ordered list which must be reordered on insert rather than a hash table which is constant time for both get and set.

Greenplum data warehouse, FOC, postgres driven, good luck ...

继续阅读：distributed-computing large-data-volumes nosql python

What should i do for accommodating large scale data storage and retrieval?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？