xtpcse from Stata - how to rewrite in R
I am currently learning R. I have no previous knowledge of STATA.
I want to reanalyze a study which was done in Stata (xtpcse linear regression with panel-corrected standard errors). I could not find the model or more detailed code in Stata or any other hint how to rewrite this in R. I have the plm package for econometrics installed for R. That's as far as I got.
The first lines of the .do file from STATA are copied below (I just saw that it's pretty unreadable. Here is a link to the txt file in which I copied the .do content: http://dl.dropbox.com/u/4004629/This%20was%20in%20the%20.do%20file.txt). I have no idea of how to go about this in a better way. I tried google-ing STATA and R comparison and the like but it did not work.
All data for the study I want to replicate are here:
https://umdrive.memphis.edu/rblanton/public/ISQ_data
---STATA---
Group variable: c_code Number of obs = 265
Time variable: year Number of groups = 27
Panels: correlated (unbalanced) Obs per group: min = 3
Autocorrelation: common AR(1) avg = 9.814815
Sigma computed by pairwise selection max = 14
Estimated covariances = 378 R-squared = 0.8604
Estimated autocorrelations = 1 Wald chi2(11) = 8321.15
Estimated coefficients = 15 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Panel-corrected
food | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lag_food | .8449038 .062589 13.50 0.000 .7222316 .967576
ciri | -.010843 .0222419 -0.49 0.626 -.0544364 .0327504
human_cap | .0398406 .0142954 2.79 0.005 .0118222 .0678591
worker_rts | -.1132705 .0917999 -1.23 0.217 -.2931951 .066654
polity_4 | .0113995 .014002 0.81 0.416 -.0160439 .0388429
market_size | .0322474 .0696538 0.46 0.643 -.1042716 .1687665
income | .0382918 .0979499 0.39 0.696 -.1536865 .2302701
econ_growth | .0145589 .0105009 1.39 0.166 -.0060224 .0351402
log_trade | -.3062828 .1039597 -2.95 0.003 -.5100401 -.1025256
fix_dollar | -.0351874 .1129316 -0.31 0.755 -.2565293 .1861545
fixed_xr | -.4941214 .2059608 -2.40 0.016 -.897797 -.0904457
xr_fluct | .0019044 .0106668 0.18 0.858 -.0190021 .0228109
lab_growth | .0396278 .0277936 1.43 0.154 -.0148466 .0941022
english | -.1594438 .1963916 -0.81 0.417 -.5443641 .2254766
_cons | .4179213 1.656229 0.25 0.801 -2.828227 3.66407
-------------+----------------------------------------------------------------
rho | .0819359
------------------------------------------------------------------------------
. xtpcse fab_metal lag_fab_metal ciri human_cap worker_rts polity_4 market
> income econ_growth log_trade fix_dollar fixed_xr xr_fluct lab_growth
> english, pairwise corr(ar1)
Update:
I just tried Vincent's code. I tried the pcse2 and vcovBK code, and they both worked (even though I'm not sure what to do with the correlation matrix that comes out of vcocBK).
However, I still have troubles reproducing the estimates of the regression coefficients in the paper I'm reanalyzing. I'm following their recipe as good as I can, the only step I'm missing is, I think, the part where in Stata "Autocorrelation: common AR(1)" is done. The paper I'm analyzing says: "OLS regression using panel corrected standard errors (Beck/Katz '95), control for first order correlation within each panel (corr AR1 option in Stata)."
How do I control for first order correlation within each panel in R?
Here is what I did so far on my data:
## run lm
res.lm <- lm(total_FDI ~ ciri + human_cap + worker_rts + polity开发者_JAVA百科_4 + lag_total + market_size + income + econ_growth + log_trade + fixed_xr + fix_dollar + xr_fluct + english + lab_growth, data=D)
## run pcse
res.pcse <- pcse2(res.lm,groupN="c_code",groupT="year",pairwise=TRUE)
As Ramnath mentioned, the pcse package will do what Stata's xtpcse
does. Alternatively, you could use the vcovBK()
function from the plm package. If you opt for the latter option, make sure you use the cluster='time'
option, which is what the Beck & Katz (1995) article suggests and what the Stata command implements.
The pcse
package works well, but there are some issues that makes a lot of intuitive user inputs unacceptable, especially if your dataset is unbalanced. You might want to try this re-write of the function that I coded a while ago. Just load the pcse
package, load the pcse2
function, and use it by following the instructions in the pcse documentation. IMHO, the function pasted below is cleaner, more flexible and more robust than the one provided by the pcse
folks. Simple benchmarks also suggest that my version may be 5 to 10 times faster than theirs, which may matter for big datasets.
Good luck!
library(Matrix)
pcse2 <- function(object, groupN, groupT, pairwise=TRUE){
## Extract basic model info
groupT <- tail(as.character((match.call()$groupT)), 1)
groupN <- tail(as.character((match.call()$groupN)), 1)
dat <- eval(parse(text=object$call$data))
## Sanity checks
if(!"lm" %in% class(object)){stop("Formula object must be of class 'lm'.")}
if(!groupT %in% colnames(dat)){stop(paste(groupT, 'was not found in data', object$call$data))}
if(!groupN %in% colnames(dat)){stop(paste(groupN, 'was not found in data', object$call$data))}
if(anyDuplicated(paste(dat[,groupN], dat[,groupT]))>0){stop(paste('There are duplicate groupN-groupT observations in', object$call$data))}
if(length(dat[is.na(dat[,groupT]),groupT])>0){stop('There are missing unit indices in the data.')}
if(length(dat[is.na(dat[,groupN]),groupN])>0){stop('There are missing time indices in the data.')}
## Expand model frame to include groupT, groupN, resid columns.
f <- as.formula(object$call$formula)
f.expanded <- update.formula(f, paste(". ~ .", groupN, groupT, sep=" + "))
dat.pcse <- model.frame(f.expanded, dat)
dat.pcse$e <- resid(object)
## Extract basic model info (part II)
N <- length(unique(dat.pcse[,groupN]))
T <- length(unique(dat.pcse[,groupT]))
nobs <- nrow(dat.pcse)
is.balanced <- length(resid(object)) == N * T
## If balanced dataset, calculate as in Beck & Katz (1995)
if(is.balanced){
dat.pcse <- dat.pcse[order(dat.pcse[,groupN], dat.pcse[,groupT]),]
X <- model.matrix(f, dat.pcse)
E <- t(matrix(dat.pcse$e, N, T, byrow=TRUE))
Omega <- kronecker((crossprod(E) / T), Matrix(diag(1, T)) )
## If unbalanced and pairwise, calculate as in Franzese (1996)
}else if(pairwise==TRUE){
## Rectangularize
rectangle <- expand.grid(unique(dat.pcse[,groupN]), unique(dat.pcse[,groupT]))
names(rectangle) <- c(groupN, groupT)
rectangle <- merge(rectangle, dat.pcse, all.x=TRUE)
rectangle <- rectangle[order(rectangle[,groupN], rectangle[,groupT]),]
valid <- ifelse(is.na(rectangle$e),0,1)
rectangle[is.na(rectangle)] <- 0
X <- model.matrix(f, rectangle)
X[valid==0,1] <- 0
## Calculate pcse
E <- crossprod(t(matrix(rectangle$e, N, T, byrow=TRUE)))
V <- crossprod(t(matrix(valid, N, T, byrow=TRUE)))
if (length(V[V==0]) > 0){stop("Error! A CS-unit exists without any obs or without any obs in a common period with another CS-unit. You must remove that unit from the data passed to pcse().")}
Omega <- kronecker(E/V, Matrix(diag(1, T)))
## If unbalanced and casewise, caluate based on largest rectangular subset of data
}else{
## Rectangularize
rectangle <- expand.grid(unique(dat.pcse[,groupN]), unique(dat.pcse[,groupT]))
names(rectangle) <- c(groupN, groupT)
rectangle <- merge(rectangle, dat.pcse, all.x=TRUE)
rectangle <- rectangle[order(rectangle[,groupN], rectangle[,groupT]),]
valid <- ifelse(is.na(rectangle$e),0,1)
rectangle[is.na(rectangle)] <- 0
X <- model.matrix(f, rectangle)
X[valid==0,1] <- 0
## Keep only years for which we have the max number of observations
large.panels <- by(dat.pcse, dat.pcse[,groupT], nrow) # How many valid observations per year?
if(max(large.panels) < N){warning('There is no time period during which all units are observed. Consider using pairwise estimation.')}
T.balanced <- names(large.panels[large.panels==max(large.panels)]) # Which years have max(valid observations)?
T.casewise <- length(T.balanced)
dat.balanced <- dat.pcse[dat.pcse[,groupT] %in% T.balanced,] # Extract biggest rectangular subset
dat.balanced <- dat.balanced[order(dat.balanced[,groupN], dat.balanced[,groupT]),]
e <- dat.balanced$e
## Calculate pcse as in Beck & Katz (1995)
E <- t(matrix(dat.balanced$e, N, T.casewise, byrow=TRUE))
Omega <- kronecker((crossprod(E) / T.casewise), Matrix(diag(1, T)))
}
## Finish evaluation, clean and output
salami <- t(X) %*% Omega %*% X
bread <- solve(crossprod(X))
sandwich <- bread %*% salami %*% bread
colnames(sandwich) <- names(coef(object))
row.names(sandwich) <- names(coef(object))
pcse <- sqrt(diag(sandwich))
b <- coef(object)
tstats <- b/pcse
df <- nobs - ncol(X)
pval <- 2*pt(abs(tstats), df, lower.tail=FALSE)
res <- list(vcov=sandwich, pcse=pcse, b=b, tstats=tstats, df=df, pval=pval, pairwise=pairwise,
nobs=nobs, nmiss=(N*T)-nobs, call=match.call())
class(res) <- "pcse"
return(res)
}
Look at the pcse package, which considers panel corrected standard errors. You certainly have to look at the documentation in STATA to figure out the assumptions made and cross check that with pcse.
精彩评论