开发者

Creating new data frames from a larger data frame using a list

I have a data frame that contains multiple data points for a large number of samples. Here is a shortened example with 3 samples each with 3 data points:

Assay       Genotype      Sample 
CCT6-002        G         sam1   
CCT6-007        G         sam1
CCT6-013        C         sam1 
CCT6-002        T         sam2   
CCT6-007        A         sam2
CCT6-013        T         sam2 
CCT6-002        T         sam3   
CCT6-007        A         sam3
CCT6-013        T         sam3 

To do my downstream analysis I would like to subset the data for each sample into an individual data frame. Since this is something that I will be doing with many data sets with changing sample names, Id like to figure out an automated way doing this so I don't need to edit my script each time with the list of new samples.

I would like my output to be a data frame for each sample with the same name as the sample. So with the example data above, the result should be 3 data frames with the names sam1, sam2, sam3. Each data frame would have 3 lines with the Assay and gen开发者_如何学JAVAotype data.

I am sorry if this is a very basic question but Im a newbie and have been working on this for quite a while. Thanks!


The split command is the easiest way to turn this into a list of data.frame objects split on sample.

myList <- split(mydf, mydf$Sample)

The items can be accessed in the list by numeric indexing (i.e. myList[[1]]) or by the name of the unique item in the variable Sample (i.e. myList$sam1).

The numeric indexing is obvioustly handy when you're going through a sequence but you can still use the name for that as well.

 #get names of the unique items in sample
 nam <- unique(mydf$Sample)
 #as a test look at the first few rows of each of my data.frames
 for( i in nam) print( head(myList[[i]]) )
 #another way to use access to the data.frame is the with() statement
 for( i in nam) with(myList[[i]], print( Assay[1:2] )

That's not necessarily the most efficient R syntax but hopefully it gets you farther along in actually using your list of data.frame objects.

Now, that gives you what you asked for but here's some advice on what you asked for. Don't do it. Just learn to properly acccess your data.frame object. You could just as easily not make the list up and go through all of the unique instances of Sample in your code... including saving them out as separate files. The advantage of that is that you can do lots of nifty vectorized commands on your intact data.frame across Sample that are much harder on the list. Just stick with you nice big data.frame.

Here are a couple of simple examples. Look at what I did above for just getting the first few lines of each of the separate data.frame objects in the list. Here's something similar just run on the big data.frame.

lapply( unique(mydf$Sample), function(x) print(head( mydf[ mydf$Sample == x,] )) )

How about something more meaningful? Let's say I want a count of each individual Genotype separated by Sample.

table( mydf$Genotype, mydf$Sample)

That's much easier than what you'd have to do with the big list. There's lots of functions like that you'll want to sue on your intact data.frame like tapply and aggregate. Even if you wanted to do something that seems like it might be easier with the data.frame broken up, like sorting within each Sample level, it's easier with the data.frame.

mydf[ order(mydf$Sample, mydf$Assay), ]

That will order by Sample and then by Assay nested within Sample.

When I started R I thought that splitting up data.frame objects was the way to go and used it a lot. Since I've learned R better I never ever do that. I don't have a single bit of R code written after the few weeks with R that ever splits up the data.frame into a list. I'm not saying you should never do it. I'm just saying that it's relatively rare that you need it or that it's the best idea. You might want to post a query on here about your end goal and get some advice on that.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜