best technique for processing bug database ( parallel computing ? )

2023-03-11 21:11 问答作者：

I've been given the task to extract all google analytics data for our website and analyse it and put in the database and then producing some reports on that data. The problem is that I end up with almost 2m records in a table. The reports consist of a few SUM() and AVG queries which as you can imagine in some cases are taking too long ( worst case without setting date filter and on wide range criteria (depends on the type of report ) takes between 8-10 minutes ). Given than users will have access to those reports this time is unacceptable ... Currently the project is using Postgres. I am more than aware that no RDMS will handle this kind of data in under 10secs especially running on a single machine.

The question is what would be the best choice of software and architecture/technique to achieve good results ?

I tried MongoDb but map/reduce is really not working any faster on a single machine given that it is single threaded (for now at least ).

I know I am probably looking into a parallel system but still unsure ... I feel comfortable using mongoDB and I read they are improving a lot in terms of clustering and using multiple nods and all but I would love someone to share other opinions and experience in the most efficient way of doing it in not only computation power but in price terms also.

Thanks

P.S. In regard to some of the answers that postgres should handle this amount of data I am posting some more details:

Table structure :

   -- Table: ga_content_tracking

-- DROP TABLE ga_content_tracking;

CREATE TABLE ga_content_tracking
(
  id serial NOT NULL,
  country character varying(255),
  city character varying(255),
  page_title character varying(255),
  page_path character varying(255),
  source character varying(255),
  referral_path character varying(255),
  date date NOT NULL,
  visits integer,
  pageviews integer,
  avgtime_on_site double precision,
  yacht_id integer,
  charter_listing boolean DEFAULT false,
  sales_listing boolean DEFAULT false,
  directory_listing boolean DEFAULT false,
  news_related boolean DEFAULT false,
  visitor_type character varying(30),
  organisation_id integer
)
WITH OIDS;
ALTER TABLE ga_content_tracking OWNER TO postgres;

-- Index: ga_content_tracking_charter_listing

-- DROP INDEX ga_content_tracking_charter_listing;

CREATE INDEX ga_content_tracking_charter_listing
  ON ga_content_tracking
  USING btree
  (charter_listing);

-- Index: ga_content_tracking_country

-- DROP INDEX ga_content_tracking_country;

CREATE INDEX ga_content_tracking_country
  ON ga_content_tracking
  USING btree
  (country);

-- Index: ga_content_tracking_dates

-- DROP INDEX ga_content_tracking_dates;

CREATE INDEX ga_content_tracking_dates
  ON ga_content_tracking
  USING btree
  (date);

-- Index: ga_content_tracking_directory_listing

-- DROP INDEX ga_content_tracking_directory_listing;

CREATE INDEX ga_content_tracking_directory_listing
  ON ga_content_tracking
  USING btree
  (directory_listing);

-- Index: ga_content_tracking_news_related

-- DROP INDEX ga_content_tracking_news_related;

CREATE INDEX ga_content_tracking_news_related
  ON ga_content_tracking
  USING btree
  (news_related);

-- Index: ga_content_tracking_organisation_id

-- DROP INDEX ga_content_tracking_organisation_id;

CREATE INDEX ga_content_tracking_organisation_id
  ON ga_content_tracking
  USING btree
  (organisation_id);

-- Index: ga_content_tracking_sales_listing

-- DROP INDEX ga_content_tracking_sales_listing;

CREATE INDEX ga_content_tracking_sales_listing
  ON ga_content_tracking
  USING btree
  (sales_listing);

-- Index: ga_content_tracking_visitor_type

-- DROP INDEX ga_content_tracking_visitor_type;

CREATE INDEX ga_content_tracking_visitor_type
  ON ga_content_tracking
  USING btree
  (visitor_type);

-- Index: ga_content_tracking_yacht_id

-- DROP INDEX ga_content_tracking_yacht_id;

CREATE INDEX ga_content_tracking_yacht_id
  ON ga_content_tracking
  USING btree
  (yacht_id);

Example Query:

superyachts=# SELECT SUM( pageviews ) as cnt, SUM( visits ) as cnt1, AVG( avgtime_on_site ) as avg1 FROM ga_content_tracking gact WHERE TRUE AND ( yacht_id IN ( 7727, 7600, 2352, 7735, 7735, 3739, 7620, 7631, 7633, 7775, 3688, 7779, 3930, 2884, 2724, 2547, 3465, 2324, 4981, 2369, 7820, 4772, 7802, 7249, 4364, 7623, 7803, 7804, 7805, 7957, 7728, 7728, 7252, 8044, 8067, 8016, 8017, 8019, 2726, 2726, 2233, 4549, 6021, 8286, 4773, 8326, 8312, 4881, 8349, 2223, 4387, 2697, 6224, 5947, 4967, 3031, 7494, 7497, 3833, 6594, 6608, 3587, 6599, 3160, 4934, 3122, 4895, 3505, 4980, 8360, 2311, 4885, 2660, 5260, 2362, 2783, 2992, 3286, 3434, 4849, 4117, 2179, 5002, 2187, 5006, 2266, 490开发者_StackOverflow中文版0, 4069, 6219, 2951, 3317, 3474, 6218, 3847, 4745, 6480, 4498, 6094, 6312, 6092, 7221, 7220, 2257, 4883, 6518, 2597, 4902, 2638, 2690, 4872, 2790, 6221, 2881, 2887, 3082, 3131, 3141, 3166, 3166, 4887, 4979, 3295, 4886, 4875, 6516, 5005, 3400, 3401, 4990, 3489, 5007, 4882, 4899, 5116, 4082, 4970, 3760, 2202, 2411, 2605, 6291, 6513, 7176, 3029, 8412, 2064, 7826, 4876, 4949, 3626, 5587, 8554, 2837, 5086, 5118, 2058, 4484, 4041, 2953, 8136, 2490, 3291, 4991, 2611, 3590 ) OR organisation_id = 1 ) ;
  cnt   | cnt1  |      avg1       
--------+-------+-----------------
 640826 | 46418 | 34.408638690454
(1 row)

Time: 114652.334 ms

I am more than aware that no RDMS will handle this kind of data in under 10secs especially running on a single machine.

I'm not sure where you get your assumptions from...

test=# create table test (id int not null);
CREATE TABLE
test=# insert into test select i from generate_series(1,2000000) i;
INSERT 0 2000000
test=# alter table test add primary key (id);
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "test_pkey" for table "test"
ALTER TABLE
test=# vacuum analyze test;
VACUUM
test=# \timing
Timing is on.
test=# select sum(id), avg(id) from test;
      sum      |         avg          
---------------+----------------------
 2000001000000 | 1000000.500000000000
(1 row)

Time: 566.079 ms
test=# select sum(t1.id), avg(t1.id) from test t1 natural join test t2;
      sum      |         avg          
---------------+----------------------
 2000001000000 | 1000000.500000000000
(1 row)

Time: 5891.536 ms

The above is PostgreSQL 9.1-beta running on a 5-year old MacBook (and not even a Pro one, at that). As you can see, it'll swallow 2M rows in half a second, and join 2M x 2M rows in a bit under 6s.

Point is, Postgres will happy handle this kind of data in under 10s, even on a single machine.

I agree with Denis that PostgreSQL shouldn't have much trouble with a couple million rows. But you can also add users to Google Analytics, and let them run against Google's servers.

The pretty colors might keep them out of your hair for a long time.

继续阅读：bigdata database mongodb shared-data

best technique for processing bug database ( parallel computing ? )

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？