开发者

Power law curve fitting for social network queries

Twitter recently announced that you can approximate the rank of any given twitter user with high accuracy by inputting their follower count in the following formula:

exp($a + $b * log(follower_count))

where $a=21 and $b=-1.1

This is obviously a lot more efficient than sorting the entire list of users by follower count for a given user.

If you have a similar data set from a different social site, how could you derive the values for $a and $b to fit that data set? Basical开发者_如何学Goly some list of frequencies the distribution of which is assumed to be power law.


You have the following model:

y = exp(a + b.log(x))

which is equivalent to:

log(y) = a + b.log(x)

Therefore, if you take logs of your data set, you end up with a linear model, so you can then use linear regression to determine the best-fit values of a and b.

However, this all sounds pretty meaningless to me. Who's to say that a given networking site determines user rank using this sort of relationship?


You could use the Microsoft Excel add-in named "Solver". It is included with Excel, but not always installed by default. Look for "add-in" and "solver" at your Excel version and load it.

After installing the add-in, do the following:

  1. Create a new worksheet. In column A you would put the id of each individual (optional)

  2. Column B, the number of followers.

  3. If the data is not sorted, sort it using column B.

  4. On column C put ranking (you know, 1, 2, 3, etc.)

  5. Put value 21 at cell D1, and -1.1 at cell E1. Those are the Twitter values for $A and $B. Those are our base values. They will possibly change.

  6. At cell D2 put a formula like this: =exp($E$1+$F$1*log(B2))

  7. Copy down the formula at D2 at the end of the data.

  8. At cell E2 put a formula to compare the actual ranking with the result of the formula (i.e., variance). e.g., =sqrt(c2*c2+d2*d2). The closer are the actual and the predicted values, the value will tend to 0.

  9. Copy down cell E2 to the end of the data.

  10. At the bottom of data, at column E, sum the variances. e.g., Let's say your data has 10,000 values. At cell E10001 enter =sum(e2:e10000).

  11. Go to the menu Data, and look for the "Solver" menu location. The location may very depending on your version of Excel. Use the "Help" facility to search for Goal Seek.

  12. Follow the instructions (I have to go now) in Help to use the Solver add-in. Obviously, the changing cells are D1 and E1, and the goal is to make E10001 (the sum of the variances) as close to zero as possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜