Creating "GDP in 1960" variable from GDP variables for different years
I'm pretty new to Stata.
I have a set of observation开发者_运维技巧s of the form "Country GDP Year". I want to create a new variable GDP1960, which gives the GDP in 1960 of each country for each year:
USA $100m 1960 USA $100m 1960 $100m
USA $200m 1965 --> USA $200m 1965 $100m
Canada $60m 1960 Canada $60m 1960 $60m
What's the right syntax to make this happen? (I assume egen
is involved in some mysterious way)
You've found a solution with cond()
, but here's a couple of suggestions that might make modeling your data easier and help you avoid problems with issues that might arise when sorting by creating your rank
variable (and I've got the egen
solution that you asked about below):
Paste the code below into your do-file editor and run it:
*---------------------------------BEGIN EXAMPLE
clear
inp str20 country str10 gdp year
"USA" "$100m" 1960
"USA" "$200m" 1965
"Canada" "$60m" 1960
"Canada" "$120m" 1965
"USA" "$250m" 1970
"Mexico" "$90m" 1970
"Canada" "$800m" 1970
"Mexico" "$160m" 1960
"Mexico" "$220m" 1965
"Mexico" "$350m" 1975
end
//1. destring gdp so that we can work with it
destring gdp, ignore("$", "m") replace
//2. Create GDP for 1960 var:
bys country: g x = gdp if year==1960
bys country: egen gdp60 = max(x)
drop x
**you could also create balanced panels to see gaps in your data**
preserve
ssc install panels
panels country year
fillin country year
li //take a look at the results win. to see how filled panel data would look
restore
//3. create a gdp variable for each year (reshape the dataset)
drop gdp60
reshape wide gdp, i(country) j(year)
**much easier to use this format for modeling
su gdp1970
**here's a fake "outcome" or response variable to work with**
g outcome = 500+int((1000-500+1)*runiform())
anova outcome gdp1960-gdp1970 //or whatever makes sense for your situation
*---------------------------------END EXAMPLE
A one-line solution is
egen gdp60 = mean(gdp / (year == 1960)), by(country)
The trick here is the division by the expression year == 1960
. This is true for 1960, in which case we divide by 1, which leaves the gdp
for that year unchanged. It is false for all other years, in which case we divide by 0. That sounds crazy, but the consequence whenever we divide by zero is just missing values, which will be ignored by egen
's mean()
function.
You could use other egen
functions, as in this case there should be at most one value for 1960 for each country, so e.g. max()
, min()
, total()
should all work too. (If a country has no value for 1960, or a missing value, we will end up with missing, which is precisely as it should be.)
Discussion at http://www.stata-journal.com/article.html?article=dm0055
Well, I found a solution in the end. It relies on the fact that generate
and replace
work on the data in its sorted order, and that you can refer to the current observation with _n.
gen rank = 100
replace rank = 50 if year == 1960
gen gdp60 = .
sort country rank
replace gdp60 = cond(iso == iso[_n-1], gdp60[_n-1], gdp[_n])
drop rank
sort country year
EDIT: A more direct solution with the same flavour:
gen wanted = year == 1960
bysort country (wanted) : gen gdp60 = gdp[_N]
drop wanted
sort country year
Here wanted
will be 1 for 1960 and 0 otherwise.
I can't think of anything shorter than these two lines:
gen temp = gdp if year == 1960
by country : egen gdp60 = max(temp)
If you want a variable for each year (e.g., gdp60, gdp61, gdp62,...
) then you probably should use reshape
精彩评论