How can I correlate pageviews with memory spikes?

2022-12-19 15:07 问答作者：

I'm having some memory problems with an application, but it's a bit difficult to figure out exactly where it is. I have two sets of data:

Pageviews

The page that was requested
The time said page was requested

Memory use

The amount of memory being used
The time this memory use was recorded

开发者_StackOverflow社区I'd like to see exactly which pageviews are correlated with high memory usage. My guess is that I'll be doing a T-test of some kind to determine which pageviews are correlated with increased memory usage. However, I'm a bit uncertain as to what kind of T-test to go with. Can someone at least point me in the right direction?

I would suggest constructing a dataset with two columns. The first would be the proportion of each page appearances in the highest memory usage times of the distribution, and the second the proportion of those (same) pages for the rest of the values of the memory distribution.

Then you would have to perform a paired test to check if the median of the differences (high - rest) is less or equal to zero (H0), against the alternative hypothesis that the median of difference is greater than zero (H1). I would suggest using the non parametric test Wilcoxon Signed Ranks Test which is a variation of Mann - Whitney Test for paired samples. It also takes into account the magnitude of the differences in each pair, something that other tests ignore (e.g. sign test).

Keep in mind that ties (zero differences) present numerous problems in derivations of nonparametric methods and should be avoided. The preferable way to deal with ties is to add a slight bit of "noise" to the data. That is, complete the test after modifying tied values by adding a small enough random variable that will not affect the ranking of the differences

I hope that test's results and plotting the differences distribution will give you insight into where the problem is.

This is an implementation of Wilcoxon Signed Ranks Test in R language

Jason,

You ask good statistical questions. Think about the amount of memory being used as a random variable. The first step is to look at the distribution of this r.v. It may not fit any known distribution, but don't let that stop us. One simple approach would be to take the highest memory usage (top 5-10%) and see if those pageviews (or the times when they were requested) are any different than the pageviews for the rest. I think you'll need some non-parametric test that compares the proportion of pageviews of the low memory sample to the proportion of pageviews in the high memory sample. Hope this helps.

What you pose is certainly an interesting statistical problem, but might I suggest a graphical approach with a good ol' spreadsheet instead?

Assign each of your pages a unique number, and make a scatter plot of page # vs memory usage. You should get a bunch of vertical lines of markers. Hopefully the culprit will be obvious.

If there are so many data points that the lines turn solid, then you can add a small amount of noise to the page numbers to broaden the lines. If the requests are overlapping then you may have to try tricks like dividing the memory by the number of concurrent requests, but your eyes should be able to pick out the offender even with a lot of noise.

Here is another idea: If you are able to join Pageviews and Memory use by the timestamp values, you could form a table like this

The value for each of the page columns might be a bit [0,1], showing that the page was requested or not, or a count of pages, depending on your data. In Memory_use column you could have the relevant memory load proportions, or counts in MB. In this way, Memory_use can be thought of a dependent variable and the pages as explanatory ones. So you could fit an appropriate (depending on the form of the dependent variable) generalized linear model to this dataset. The results of this analysis will give you insight on the following

-Which pages significantly affect the value of memory use

-The extent to which each page contributes to the load (by its coefficient in the model)

-The possibility that other factors, not measured, play a significant role in memory load (overdispersion), with the worst case that all the predictor variables may turn out to be unimportant.

继续阅读：algorithm memory memory-leaks optimization statistics

How can I correlate pageviews with memory spikes?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？