开发者

Combining tab delim files into a single file using R

I have several txt files with 3 columns in each files like this: file 1:

ProbeID X_Signal_intensity X_P-Value   
xxx         2.34          .89
xxx         6.45          .04 
xxx         1.09          .91  
xxx         5.87          .70
.            .            . 
.            .            .
.            .            .     

file 2:

ProbeID Y_Signal_intensity Y_P-Value   
xxx         1.4             .92
xxx         2.55            .14 
xxx         4.19            .16  
xxx         3.47            .80
.            .               . 
.            .               .
.            .               . 

file 3:

ProbeID Z_Signal_intensity Z_P-Value   
xxx         9.40             .82
xxx         1.55            .04 
xxx         3.19            .56  
xxx         2.47            .90
.            . 开发者_如何学Python              . 
.            .               .
.            .               . 

In all the above files the values of ProbeID column are identical but not the other columns.Now I want to combine the all the above files using a for-loop into a single file like this:

ProbeID X_intensity X_P-Value   Y_intensity Y_P-Value   Z_intensity Z_P-Value     
xxx      2.34          .89       1.4             .92     9.40            .82
xxx      6.45          .04       2.55            .14     1.55            .04
xxx      1.09          .91       4.19            .16     3.19            .56
xxx      5.87          .70       3.47            .80     2.47            .90

Please do help me.


Read in the files as given by Richie Cotton, but make sure you add the appropriate extra arguments in the apply call. For one, header=TRUE should probably be added.

file.names <- c("file X.txt", "file Y.txt", "file Z.txt")
file.list <- lapply(file.names, read.table, header=TRUE)

Then you'll probably need a merge_recurse from the reshape package :

require(reshape)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")

This will work for any given number of dataframes, provided it's not a billion of them. For more information on the arguments used, see the help page of ?merge.

CORRECTION : in merge_recurse, you have to use all.x and all.y as shown in the correction above. You can't just use the shortcut all or you'll get errors.

Small demonstration :

X2 <- data.frame(ProbeID=(2:4),Z2=4:6)
X1 <- data.frame(ProbeID=1:3,Z1=1:3)
X3 <- data.frame(ProbeID=1:3,Z3=7:9)
file.list <- list(X1,X2,X3)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")
> mynewframe
  ProbeID Z1 Z2 Z3
1       1  1 NA  7
2       2  2  4  8
3       3  3  5  9
4       4 NA  6 NA


Read in your files

filenames <- c("file X.txt", "file Y.txt", "file Z.txt")
data_list <- lapply(filenames, read.table)

Combine them into one big data frame

all_data <- do.call(cbind, data_list)

all_data <- do.call(merge, data_list, by = "ProbeID")

This gives a good lesson to "always concentrate when providing an answer". cbind isn't smart enough to do ID matching, and merge isn't smart enough to handle more than two data frames. Take a look at Joris's answer and use merge_recurse instead. Or forget what you thought you wanted and use my other answer below.


Actually, a better idea, rather than having many columns would be to have just 4 columns: ProbeID, Signal_intensity, P_value and Source_file.

data_list <- lapply(data_list, function(x) {
  colnames(x) <- c("ProbeID", "Signal_intensity", "P_value")
  x
})

all_data <- do.call(rbind, data_list)
all_data$Source_file <- rep(filenames, times = sapply(data_list, nrow))


My approach is to read the files into data.frames

see help(read.delim) for reading modes.

After you have your three data.frames you can use

total <- merge(dataframeA,dataframeB,by="ProbeID")

look here http://www.statmethods.net/management/merging.html for documentation.


I am going to throw another approach into the mix which uses Reduce

Reduce(function(...) merge(..., all = T), file.list)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜