Efficient alternative to merge() when building dataframe from json files with R?
I have written the following code which works, but is painfully slow once I start executing it over thousands of records:
require("RJSONIO")
people_data <- data.frame(person_id=numeric(0))
json_data <- fromJSON(json_file)
n_people <- length(json_data)
for(person in 1:n_people) {
person_dataframe <- as.data.frame(t(unlist(json_data[[person]])))
people_data <- merge(people_data, person_dataframe, all=TRUE)
}
output_file <- paste("people_data",".csv")
write.csv(people_data, file=output_file)
I am attempting to build a unified data table from a series of json-formated files. The fromJSON()
function reads in the data as lists of lists. Each element of the list is a person, which then contains a list of the attributes for that person.
For example:
[[1]]
person_id
name
gender
hair_color
[[2]]
person_id
name
location
gender
height
[[...]]
structure(list(person_id = "Amy123", name = "Amy", gender = "F",
hair_color = "开发者_JAVA技巧brown"),
.Names = c("person_id", "name", "gender", "hair_color"))
structure(list(person_id = "matt53", name = "Matt",
location = structure(c(47231, "IN"),
.Names = c("zip_code", "state")),
gender = "M", height = 172),
.Names = c("person_id", "name", "location", "gender", "height"))
The end result of the code above is matrix where the columns are every person-attribute that appears in the structure above, and the rows are the relevant values for each person. As you can see though, some data is missing for some of the people, so I need to ensure those show up as NA
and make sure things end up in the right columns. Further, location
itself is a vector with two components: state
and zip_code
, meaning it needs to be flattened to location.state
and location.zip_code
before it can be merged with another person record; this is what I use unlist()
for. I then keep the running master table in people_data
.
The above code works, but do you know of a more efficient way to accomplish what I'm trying to do? It appears the merge()
is slowing this to a crawl... I have hundreds of files with hundreds of people in each file.
Thanks! Bryan
UPDATE: Based on the feedback below, I tried to build a list of all the people, and then convert it all at one time into a dataframe. I let it run overnight and still didn't finish making the dataframe. There are around 1/2 million people in the list. That codes looks like this:
require("RJSONIO")
require("plyr")
people_data <- data.frame(person_id=numeric(0))
people_list <- list()
json_data <- fromJSON(json_file)
n_people <- length(json_data)
for(person in 1:n_people) {
people_list[[person]] <- t(unlist(json_data[[person]]))
}
#PROBLEM CODE, SLOW, 1/2 million records in people_list
people_data <- rbind.fill(lapply(people_list, as.data.frame))
output_file <- paste("people_data",".csv")
write.csv(people_data, file=output_file)
If you don't expect duplicate records to exist, you can use rbind.fill
from the plyr
package.
I assume that you know all names for fields. Here is solution.
Example data
data_list <- list(list(person_id = "Amy123", name = "Amy", gender = "F",hair_color = "brown"), list(person_id = "matt53", name = "Matt",location = list(c("zip_code"=47231, "state"="IN")),gender = "M", height = 172))
Names for example records
nm1 <- names(unlist(data_list[[1]])) nm2 <- names(unlist(data_list[[2]])) nm <- c(nm2,nm1[!nm1 %in% nm2])
Code of program
record_template <- vector(mode="character",length=length(nm))
names(record_template) <- nm
record_template <- "NA"
rec <- function(x) {
tmp <- record_template
tmp[names(unlist(x))] <- unlist(x)
return(tmp)
}
do.call(rbind,lapply(data_list,rec))
EDIT
You can get names from list using:
nm <- unique(unlist(lapply(data_list,function(x) names(unlist(x))),use.names=F))
This should be a comment but ldply might be more efficient once you get your unlisted person objects back in a list.
For example, once you do this:
for(person in 1:n_people) {
people_list[[person]] <- t(unlist(json_data[[person]]))
}
you can just use ldply like so:
people_data <- ldply(people_list,function(x) as.data.frame)
Now this might not be that much faster, but you can parallelize this efficiently like so:
people_data <- ldply(people_list,function(x) as.data.frame,.parallel=TRUE)
Which might give you more speed gains
精彩评论