best technique for processing bug database ( parallel computing ? )
I've been given the task to extract all google analytics data for our website and analyse it and put in the database and then producing some reports on that data. The problem is that I end up with almost 2m records in a table. The reports consist of a few SUM() and AVG queries which as you can imagine in some cases are taking too long ( worst case without setting date filter and on wide range criteria (depends on the type of report ) takes between 8-10 minutes ). Given than users will have access to those reports this time is unacceptable ... Currently the project is using Postgres. I am more than aware that no RDMS will handle this kind of data in under 10secs especially running on a single machine.
The question is what would be the best choice of software and architecture/technique to achieve good results ?
I tried MongoDb but map/reduce is really not working any faster on a single machine given that it is single threaded (for now at least ).
I know I am probably looking into a parallel system but still unsure ... I feel comfortable using mongoDB and I read they are improving a lot in terms of clustering and using multiple nods and all but I would love someone to share other opinions and experience in the most efficient way of doing it in not only computation power but in price terms also.
Thanks
P.S. In regard to some of the answers that postgres should handle this amount of data I am posting some more details:
Table structure :
-- Table: ga_content_tracking
-- DROP TABLE ga_content_tracking;
CREATE TABLE ga_content_tracking
(
id serial NOT NULL,
country character varying(255),
city character varying(255),
page_title character varying(255),
page_path character varying(255),
source character varying(255),
referral_path character varying(255),
date date NOT NULL,
visits integer,
pageviews integer,
avgtime_on_site double precision,
yacht_id integer,
charter_listing boolean DEFAULT false,
sales_listing boolean DEFAULT false,
directory_listing boolean DEFAULT false,
news_related boolean DEFAULT false,
visitor_type character varying(30),
organisation_id integer
)
WITH OIDS;
ALTER TABLE ga_content_tracking OWNER TO postgres;
-- Index: ga_content_tracking_charter_listing
-- DROP INDEX ga_content_tracking_charter_listing;
CREATE INDEX ga_content_tracking_charter_listing
ON ga_content_tracking
USING btree
(charter_listing);
-- Index: ga_content_tracking_country
-- DROP INDEX ga_content_tracking_country;
CREATE INDEX ga_content_tracking_country
ON ga_content_tracking
USING btree
(country);
-- Index: ga_content_tracking_dates
-- DROP INDEX ga_content_tracking_dates;
CREATE INDEX ga_content_tracking_dates
ON ga_content_tracking
USING btree
(date);
-- Index: ga_content_tracking_directory_listing
-- DROP INDEX ga_content_tracking_directory_listing;
CREATE INDEX ga_content_tracking_directory_listing
ON ga_content_tracking
USING btree
(directory_listing);
-- Index: ga_content_tracking_news_related
-- DROP INDEX ga_content_tracking_news_related;
CREATE INDEX ga_content_tracking_news_related
ON ga_content_tracking
USING btree
(news_related);
-- Index: ga_content_tracking_organisation_id
-- DROP INDEX ga_content_tracking_organisation_id;
CREATE INDEX ga_content_tracking_organisation_id
ON ga_content_tracking
USING btree
(organisation_id);
-- Index: ga_content_tracking_sales_listing
-- DROP INDEX ga_content_tracking_sales_listing;
CREATE INDEX ga_content_tracking_sales_listing
ON ga_content_tracking
USING btree
(sales_listing);
-- Index: ga_content_tracking_visitor_type
-- DROP INDEX ga_content_tracking_visitor_type;
CREATE INDEX ga_content_tracking_visitor_type
ON ga_content_tracking
USING btree
(visitor_type);
-- Index: ga_content_tracking_yacht_id
-- DROP INDEX ga_content_tracking_yacht_id;
CREATE INDEX ga_content_tracking_yacht_id
ON ga_content_tracking
USING btree
(yacht_id);
Example Query:
superyachts=# SELECT SUM( pageviews ) as cnt, SUM( visits ) as cnt1, AVG( avgtime_on_site ) as avg1 FROM ga_content_tracking gact WHERE TRUE AND ( yacht_id IN ( 7727, 7600, 2352, 7735, 7735, 3739, 7620, 7631, 7633, 7775, 3688, 7779, 3930, 2884, 2724, 2547, 3465, 2324, 4981, 2369, 7820, 4772, 7802, 7249, 4364, 7623, 7803, 7804, 7805, 7957, 7728, 7728, 7252, 8044, 8067, 8016, 8017, 8019, 2726, 2726, 2233, 4549, 6021, 8286, 4773, 8326, 8312, 4881, 8349, 2223, 4387, 2697, 6224, 5947, 4967, 3031, 7494, 7497, 3833, 6594, 6608, 3587, 6599, 3160, 4934, 3122, 4895, 3505, 4980, 8360, 2311, 4885, 2660, 5260, 2362, 2783, 2992, 3286, 3434, 4849, 4117, 2179, 5002, 2187, 5006, 2266, 490开发者_StackOverflow中文版0, 4069, 6219, 2951, 3317, 3474, 6218, 3847, 4745, 6480, 4498, 6094, 6312, 6092, 7221, 7220, 2257, 4883, 6518, 2597, 4902, 2638, 2690, 4872, 2790, 6221, 2881, 2887, 3082, 3131, 3141, 3166, 3166, 4887, 4979, 3295, 4886, 4875, 6516, 5005, 3400, 3401, 4990, 3489, 5007, 4882, 4899, 5116, 4082, 4970, 3760, 2202, 2411, 2605, 6291, 6513, 7176, 3029, 8412, 2064, 7826, 4876, 4949, 3626, 5587, 8554, 2837, 5086, 5118, 2058, 4484, 4041, 2953, 8136, 2490, 3291, 4991, 2611, 3590 ) OR organisation_id = 1 ) ;
cnt | cnt1 | avg1
--------+-------+-----------------
640826 | 46418 | 34.408638690454
(1 row)
Time: 114652.334 ms
I am more than aware that no RDMS will handle this kind of data in under 10secs especially running on a single machine.
I'm not sure where you get your assumptions from...
test=# create table test (id int not null);
CREATE TABLE
test=# insert into test select i from generate_series(1,2000000) i;
INSERT 0 2000000
test=# alter table test add primary key (id);
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "test_pkey" for table "test"
ALTER TABLE
test=# vacuum analyze test;
VACUUM
test=# \timing
Timing is on.
test=# select sum(id), avg(id) from test;
sum | avg
---------------+----------------------
2000001000000 | 1000000.500000000000
(1 row)
Time: 566.079 ms
test=# select sum(t1.id), avg(t1.id) from test t1 natural join test t2;
sum | avg
---------------+----------------------
2000001000000 | 1000000.500000000000
(1 row)
Time: 5891.536 ms
The above is PostgreSQL 9.1-beta running on a 5-year old MacBook (and not even a Pro one, at that). As you can see, it'll swallow 2M rows in half a second, and join 2M x 2M rows in a bit under 6s.
Point is, Postgres will happy handle this kind of data in under 10s, even on a single machine.
I agree with Denis that PostgreSQL shouldn't have much trouble with a couple million rows. But you can also add users to Google Analytics, and let them run against Google's servers.
The pretty colors might keep them out of your hair for a long time.
精彩评论