开发者

How to create multiline graphs from a collection of rdta files

I have over 100 survey data files with the following filename structure in a common director开发者_StackOverflow中文版y:

BD-1994.rdta
BD-1996.rdta
BD-1999.rdta
BD-2004.rdta
BF-1992.rdta
...
UG-1988.rdta
UG-1995.rdta
UG-2001.rdta
VN-1992.rdta
VN-1997.rdta

The leading two letters (eg "BD") represent a specific country (by its ISO code) and the four digits represent the year of a given survey.

I would like to process these data so I can create one multi-line, time-series graph of fertility rates per country where each line represents a year of the survey. For example, the first graph will be for "BD" (Bangladesh) and will display four time-series for years 1994, 1996, 1999, and 2004.

The structure of the individual files is as follows:

time     fertility
1        3.2
2        2.6
...      ...
7        2.4 

My idea at the moment is to use rbind within a for loop and create one massive dataset with all the data in it. Then I need to split the data neatly by country code, perhaps using a function like "subset" (but doesn't look like subset is the right tool for the job.

Any suggestions on how to perform this data management so I can then call the plot function in R on a dataframe that contains the survey data for all years within a given country?

Thank you


Here is one approach using ggplot2 and plyr. The basic idea is to create two helper functions to (a) extract data from each rdata file into a data frame and (b) plot time series for each country. Once these functions are defined, it is relatively straightforward to use plyr functions to loop through the files to produce the required graphs. I would suggest that you run this code on your data, and report back with any errors that you get, since I am unable to test my code in the absence of any data.

require(plyr)

# function to extract data frame from each rdata file
get_data_frame = function(file_name){
    temp_env = new.env()
    load(file_name, temp_env)
    mydata  = get(ls(envir = temp_env), temp_env)
    country = substr(file_name, 1, 2)
    year    = substr(file_name, 4, 7)
    df = data.frame(mydata, country, year)
    return(df)
}

# function to save time series plot of fertility grouped by year
plot_country_data = function(country_df){

    require(ggplot2)
    p1 = ggplot(country_df, aes(x = time, y = fertility)) +
         geom_line(aes(group = year))
    ggsave(filename = paste(country_df, ".pdf", sep = ""))

}

# extract all rdata files in working directory
rdata_files = list.files(pattern = 'rdata')

# consolidate data into one big data frame
big_data   = ldply(rdata_files, get_data_frame)

# plot data for each country and save as pdf
d_ply(big_data, .(country), plot_country_data)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜