How to deal with large data sets for analytics, and varying numbers of columns'?

2023-01-13 20:09 问答作者：

I'm building an analytics system for a mobile application and have had some difficulty deciding how to store and process large amounts of data.

Each row will represent a 'view' (like a web page) and store some fixed attributes, like user agent and date. Additionally, each view may have a varying number of extra attributes, which relate to actions performed or content identifiers.

I've looked at Amazon SimpleDb which handles the varying number of attributes well, but has no support for GROUP BY and doesn't seem to perfo开发者_如何转开发rm well when COUNTing rows either. Generating a monthly graph with 30 data points would require a query for each day per dataset.

MySQL handles the COUNT and GROUP modifiers much better but additional attributes require storage in a link table and a JOIN to retrieve views where attributes match a given value, which isn't very fast. 5.1's partitioning feature may help speed things up a bit.

What I have gathered from a lot of reading and profiling queries on the aforementioned systems is that ultimately all of the data needs to be aggregated and stored in tables for quick report generation.

Have I missed anything obvious in my research and is there a better way to do this than use MySQL? It doesn't feel like the right task for the job, but I can't find anything capable of both GROUP/COUNT queries and a flexible table structure.

This is a case where you want to store the data once and read it over and over. Further I think that you'd wish the queries to be preprocessed instead of needing to be calculated on every go.

My suggestion for you is to store your data in CouchDB for the following reasons:

Its tables are structureless
Its queries are pre-processed
Its support for map-reduce allows your queries to handle group by
It has a REST service access model which lets you connect from pretty much anything that handle HTTP requests

You may find this suggestion a little out there considering how new CouchDB is. However I'd suggest for you to read about it because personally I think running a CouchDB database is sweet and lightweight. More light weight than MySQL

Keeping it in MySQL: If the amount of writes are limited / reads are more common, and the data is relatively simple (i.e: you can predict possible characters), you could try to use a text/blob column in the main table, which is updated with comma separated values or key/value pairs with an AFTER INSERT / UPDATE trigger on the join table. You keep the actual data in a separate table, so searching for MAX's / specific 'extra' attributes can still be done relatively fast, but retrieving the complete dataset for one of your 'views' would be a single row in the main table, which you can split into the separate values with the script / application you're using, relieving much of the stress on the database itself.

The downside of this is a tremendous increase in cost of updates / inserts in the join table: every alteration of data would require a query on all related data for a record, and a second insert into the 'normal' table, something like

UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache  = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).

However, as analytics data goes it usually trails somewhat, so possibly not every update has to trigger an update in cache, just a daily cronscript filling the cache with yesterdays data could do.

继续阅读：analytics php

How to deal with large data sets for analytics, and varying numbers of columns'?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？