开发者

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.

So, is there an easy way to turn data that looks like this:

country   country.isocode year      POP           ci      grgdpch
Argentina             ARG 1951 17517.34 18.445022145 3.4602044759
Argentina             ARG 1952 17876.96  17.76066507 -7.887407586
Argentina             ARG 1953 18230.82 18.365255769 2.3118720688
Argentina             ARG 1954 18580.56 16.982113434 1.5693778844
Argentina             ARG 1955 18927.82 17.488907008 5.3690276523
Argentina             ARG 1956 19271.51 15.907756547 0.3125559183
Argentina             ARG 1957 19610.54 17.028450999 2.4896639667
Argentina             ARG 1958 19946.54 17.541597134 5.0025894968
Argent开发者_如何学JAVAina             ARG 1959 20281.15 16.137310492 -6.763501447
Argentina             ARG 1960 20616.01 20.519539628  8.481742144
...
Venezuela             VEN 1997 22361.80 21.923577413  5.603872759
Venezuela             VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela             VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela             VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela             VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela             VEN 2002 24191.77 16.290691319  -8.02535946
Venezuela             VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela             VEN 2004 24904.62 17.147693312 14.644028806
Venezuela             VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela             VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela             VEN 2007 26023.53 26.518210052 4.1367897561

into something like this:

country   country.isocode period   AvPOP     Avci Avgrgdpch
Argentina             ARG      1   18230 17.38474  1.423454
...
Venezuela             VEN     12   25274 21.45343  5.454334

Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?


This is the stuff aggregate is made for. :

Df <- data.frame(
    year=rep(1951:1970,2),
    country=rep(c("Arg","Ven"),each=20),
    var1 = c(1:20,51:70),
    var2 = c(20:1,70:51)
)

Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")

> aggregate(Df[id],list(Df$country,Level),mean)
  Group.1     Group.2 var1 var2
1     Arg [1951,1956)    3   18
2     Ven [1951,1956)   53   68
3     Arg [1956,1961)    8   13
4     Ven [1956,1961)   58   63
5     Arg [1961,1966)   13    8
6     Ven [1961,1966)   63   58
7     Arg [1966,1971)   18    3
8     Ven [1966,1971)   68   53

The only thing you might want to do, is to rename the categories and the variable names.


For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.

library(plyr)
library(zoo)
library(pwt)

# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
        pwt6.3$country %in% c("Argentina", "Venezuela"), 
        c("country", "isocode", "year", "pop", "ci", "rgdpch")
]

# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
    rollmean(x, 5)
}

# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))

Here is the output from this:

> head(pwt, 10)
           country isocode year      pop       ci   rgdpch
ARG-1950 Argentina     ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina     ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina     ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina     ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina     ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina     ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina     ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina     ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina     ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina     ARG 1959 20281.15 16.13731 8125.515

> head(pwt.ma)
    country year      pop       ci   rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250

Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.

EDIT:

@Joris Meys gently pointed out that you might in fact be after the average for five-year periods.

Here is the modified code to do this:

pwt$period <- cut(pwt$year, seq(1900, 2100, 5)) 
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma

And the output:

> pwt.ma
     country      period   year       pop       ci    rgdpch
1  Argentina (1945,1950] 1950.0 17150.336 13.29214  7736.338
2  Argentina (1950,1955] 1953.0 18226.699 17.80839  7730.874
3  Argentina (1955,1960] 1958.0 19945.149 17.42693  8410.610
4  Argentina (1960,1965] 1963.0 21616.623 19.09067  9000.918
5  Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6  Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7  Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8  Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9  Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0  5009.006 41.07972  7067.947
15 Venezuela (1950,1955] 1953.0  5684.009 44.60849  8132.041
16 Venezuela (1955,1960] 1958.0  6988.078 37.87946  9468.001
17 Venezuela (1960,1965] 1963.0  8451.073 26.93877  9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497


Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape


There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have

Df <- data.frame(
    year=rep(1951:1970,2),
    country=rep(c("Arg","Ven"),each=20),
    var1 = c(1:20,51:70),
    var2 = c(20:1,70:51)
)

Now, using cut to create the periods, we can then group on them and get the means:

 Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>% 
 group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))

Source: local data frame [8 x 4]
Groups: country

  country      period V1 V2
1     Arg [1951,1956)  3 18
2     Arg [1956,1961)  8 13
3     Arg [1961,1966) 13  8
4     Arg [1966,1971) 18  3
5     Ven [1951,1956) 53 68
6     Ven [1956,1961) 58 63
7     Ven [1961,1966) 63 58
8     Ven [1966,1971) 68 53
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜