ANOVA over time in Python, what am I doing?

2023-02-03 20:02 问答作者：

I really like statistics, but haven't taken a course in over 6 years. I'm having trouble figuring out what kind of test I need here, and the best numpy/scipy/R function to use for these kinds of issues.

I've got a table of visitors and their corresponding properties (e.g. "Browser = Moz开发者_运维知识库illa, Referrer = Google"), as well as a variable value per visitor (e.g. $5), grouped into data points over time.

My goal is to:

A) Find the most significant property families, with a score for "how significant" the family is

Example of a conclusion I want to draw*:

Referrer has 10x larger effect size upon value-per-visitor than Browser
=> PropertyFamily('browser').significance = 1
=> PropertyFamily('referrer').significance = 10

AND

B) Find the most significant properties within families, with significance scores.

Sample of a conclusion I'd like to draw:

GIVEN THAT Value:Baseline => $5/hit
5 hits from IE @ $5/hit (equal to baseline) => no significance
1 hit from Netscape @ $0 => little significance (not enough data)
10 hits from FF @ $10/hit => HIGH significance (hits and delta_value both high)

My questions are:

1) Are there numpy/scipy/R functions to make my life easy here?

2) Can anyone that knows a bit more about ANOVA (analysis of variance) and ANOVA-over-time please provide feedback? I'm not positive that I'm even doing this right, and could be missing something simple. Confirmation or correction are both appreciated.

Note that these are ARRAYS of (hits, values, days) over the last 30 days. For example, if there's a large peak (relative to baseline) in Value-Of-Mozilla on Monday, and a drop (below baseline) in Value-Of-Mozilla on Tuesday, I want Mozilla to show up as a "significant" property (rather than the peak/drop canceling each other out)

Example of my input data, before map/reducing:

data = {
'baseline': [(hits, value, day) for hits, value, day in last_thirty_days('baseline')],
'browser': {
  'mozilla': [(hits, value, day) for hits, value, day in last_thirty_days('browser', 'mozilla')],
  ... etc ...
  }
}
... etc ...

Here's my current code -- It runs on Dumbo/Hadoop, and provides a number for "significance" that I basically invented the formula for. While my formula works, and gives meaningful data, my values for "significance" aren't well defined (a "significant" property will usually have a score >= 100, but this changes with the size of the dataset) and I know that there's probably a "real formula" for this.

# Runs after each (hits, value, date) tuple has been grouped
# into corresponding "plot points", as they would appear on a graph
pp = PlotPoint(property, date, hits, value)
pp.epc = float(pp.value/pp.hits) if pp.hits else 0

# Finds PlotPoint('baseline', date)
# if pp = PlotPoint('firefox', '1-1-10')
#  then pp.baseline == PlotPoint('baseline', '1-1-10')
baseline = pp.baseline()
if baseline.hits == 0:
    volume_ratio = 0 
else:
    volume_ratio = round(100*pp.hits/baseline.hits)
value_ratio = baseline.epc - pp.epc

# Make up a significance value --
# e.g. (10% of visitors * ($1 delta from baseline))^2
pp.significance = math.sqrt(volume_ratio * value_ratio **2)

# OK, we have values for each plotpoint, now sum them up
# to get values for the whole property (over a 30day period) 
pps = property.plotpoint_set.all()
property.hits = sum([p.hits for p in pps])
property.value = sum([p.value for p in pps])
property.epc = property.value/property.hits
value_delta = baseline.epc - property.epc

# Make up a significance for the Property, based on each point's significance
property.significance = math.log(sum(
                [sss.significance**2 for sss in pps]
                )*abs(value_delta)+1)

Thanks in advance!

AFAIK, the statistical tests available in numpy/scipy are fairly basic. You might want to look into R, a language more or less dedicated to statistics, and with a lot of advanced functions available.

Also, I don't think a MANOVA is really what you want to do. MANOVA is for when you have several interacting dependent variables. This is really just an ANOVA.

Examples of what you could do in R:

bybrowser = lm(value ~ browser, data=visitors)
anova(bybrowser)
byreferrer = lm(value ~ referrer, data=visitors)
anova(byreferrer)
byreferrerandbrowser = lm(value ~ browser * referrer, data=visitors)
anova(byreferrerandbrowser)

Note that this all assumes that your values are normally distributed. You should check this assumption (hist(visitors$value) is a good start.). If they're not, either find a way to normalise them (try taking the log), or use an appropriate non-parametric test.

Oh, and finally, if you want advice on stats, there's a sister site dedicated to just that: https://stats.stackexchange.com/

继续阅读：anova numpy python significance statistics

ANOVA over time in Python, what am I doing?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Best solution for private video database [closed]

国内夏季避暑旅游胜地有哪些？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?