开发者

Optimal way to create a histogram/frequency distribution in Oracle?

I have an events table with two columns eventkey (unique, primary-key) and 开发者_JAVA技巧createtime, which stores the creation time of the event as the number of milliseconds since Jan 1 1970 in a NUMBER column.

I would like to create a "histogram" or frequency distribution that shows me how many events were created in each hour of the past week.

Is this the best way to write such a query in Oracle, using the width_bucket() function? Is it possible to derive the number of rows that fall into each bucket using one of the other Oracle analytic functions rather than using width_bucket to determine what bucket number each row belongs to and doing a count(*) over that?

-- 1305504000000 = 5/16/2011 12:00am GMT
-- 1306108800000 = 5/23/2011 12:00am GMT
select 
timestamp '1970-01-01 00:00:00' + numtodsinterval((1305504000000/1000 + (bucket * 60 * 60)), 'second') period_start,
numevents
from (
  select bucket, count(*) as events from (
    select eventkey, createtime, 
    width_bucket(createtime, 1305504000000, 1306108800000, 24 * 7) bucket
    from events 
    where createtime between 1305504000000 and 1306108800000
  ) group by bucket
) 
order by period_start


If your createtime were a date column, this would be trivial:

SELECT TO_CHAR(CREATE_TIME, 'DAY:HH24'), COUNT(*) 
  FROM EVENTS
 GROUP BY TO_CHAR(CREATE_TIME, 'DAY:HH24');

As it is, casting the createtime column isn't too hard:

select TO_CHAR( 
         TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000), 
         'DAY:HH24') AS BUCKET, COUNT(*)
   FROM EVENTS
  WHERE createtime between 1305504000000 and 1306108800000
 group by TO_CHAR( 
         TO_DATE('19700101', 'YYYYMMDD') + createtime / 86400000), 
         'DAY:HH24') 
 order by 1

If, alternatively, you're looking for the fencepost values (for example, where do I go from the first decile (0-10%) to the next (11-20%), you'd do something like:

select min(createtime) over (partition by decile) as decile_start,
       max(createtime) over (partition by decile) as decile_end,
       decile
  from (select createtime, 
               ntile (10) over (order by createtime asc) as decile
          from events
         where createtime between 1305504000000 and 1306108800000
       )


I'm unfamiliar with Oracle's date functions, but I'm pretty certain there's an equivalent way of writing this Postgres statement:

select date_trunc('hour', stamp), count(*)
from your_data
group by date_trunc('hour', stamp)
order by date_trunc('hour', stamp)


Pretty much the same response as Adam, but I would prefer to keep the period_start as a time field so it is easier to filter further if needed:

with
events as
(
    select rownum eventkey, round(dbms_random.value(1305504000000, 1306108800000)) createtime
    from dual
    connect by level <= 1000 
)
select
    trunc(timestamp '1970-01-01 00:00:00' + numtodsinterval(createtime/1000, 'second'), 'HH') period_start,
    count(*) numevents
from
    events
where
    createtime between 1305504000000 and 1306108800000
group by
    trunc(timestamp '1970-01-01 00:00:00' + numtodsinterval(createtime/1000, 'second'), 'HH')
order by
    period_start


Using oracle provided function "WIDTH_BUCKET" to accumulate continuous or fine-discrete data. The following example shows a way to create a histogram with 5 buckets and gather "COLUMN_VALUE" from 510 to 520 (so each bucket gets values of range 2). WIDTH_BUCKET will create additional id=0 and num_buckets+1 buckets for values below min and above max.

SELECT "BUCKET_ID", count(*), 
CASE
    WHEN "BUCKET_ID"=0 THEN -1/0F
    ELSE 510+(520-510)/5*("BUCKET_ID"-1)
END "BUCKET_MIN",
CASE
    WHEN "BUCKET_ID"=5+1 THEN 1/0F
    ELSE 510+(520-510)/5*("BUCKET_ID")
END "BUCKET_MAX"
FROM 
(
    SELECT  "COLUMN_VALUE", 
            WIDTH_BUCKET("COLUMN_VALUE", 510, 520, 5) "BUCKET_ID"
    FROM "MY_TABLE"
)
group by "BUCKET_ID"
ORDER BY "BUCKET_ID";

Sample output

 BUCKET_ID   COUNT(*) BUCKET_MIN BUCKET_MAX
---------- ---------- ---------- ----------
         0         45       -Inf   5.1E+002
         1        220   5.1E+002  5.12E+002
         2        189  5.12E+002  5.14E+002
         3         43  5.14E+002  5.16E+002
         4          3  5.16E+002  5.18E+002

In my table, there's no 518-520, so bucket with id=5 is not shown. On the other hand, there's values below min (510), so there's a bucket with id=0, gathering -inf to 510 values.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜