Adding a new column based on the column header
I am working with R to convert several million lines of data in a single column like this.....
fixedStep chrom=chr7 start=10239 step=1
0.064
0.064
0.064
0.055
0.055
0.089
0.076
fixedStep chrom=chr7 start=10262 step=1
0.076
0.076
0.089
0.076
0.076
0.076
0.076
0.089
0.089
0.076
0.089
0.076
0.089
0.089
fixedStep chrom=chr7 start=10398 step=1
0.076
0.089
0.089
0.089
0.089
0.076
to this......
10239 0.064
10240 0.064
10241 0.064
10242 0.055
10243 0.055
10244 0.089
10245 0.076
10262 0.076
10263 0.076
10264 0.089
10265 0.076
10266 0.076
10267 0.076
10268 0.076
10269 0.089
10270 0.089
10271 0.076
10272 0.089
10273 0.076
10274 0.089
10275 0.089
10398 0.076
10399 0.089
10400 0.089
10401 0.089
10402 0.089
10403 0.076
i.e., I want to add a new column of numbers (either before or after the data, in the example above, it is before the data). The numbers of the new column start from start=value
, and increments by 1 (step=1
), until a new column header (fixedStep chrom=chr7 start=10262 step=1
) is reached. When this happens the numbers start from start=new value
, and again i开发者_C百科ncrements by 1 (step=1
), until a new column header..and so on and so forth.
Since it is a large file, I cannot load the it into R workspace. It would be good to combine it with UNIX/linux tools to perform this operation.
Since you said unix...
#!/usr/bin/awk -f
/^fixedStep/ {
i=int(substr($0,match($0,"start=")+6))
d=int(substr($0,match($0,"step=")+5))
}
!/^f/ { print i, $0; i+=d }
What it does: on lines that start with "fixedStep" it finds the position of "start=", adds 6 (the length of "start="), takes the substring starting at that position and truncates it to an integer value and assigns it to i
(in awk integers and strings are pretty interchangeable, and "12345 step=1" works fine as an integer with value 12345, but we want to print only the integer part later so it makes sense to truncate it here). Similarly for "step=".
On lines not starting with "f", it prints i
and the line, and adds d
to i
.
You could use the function readLines
to scan the file one by one. This way you don't have to load everything into the memory. One possibility would be to use following function.
Not concise, not fast (as you read line by line), but it gives you the dataframe without reading in the whole file. If you only want to generate a new file, by all means use something else but R. The awk solution given here seems the best shot at it.
ReadFile <- function(file){
DF <- data.frame(ID=numeric(0),value=numeric(0))
while(1){
z <- readLines(file,1)
if(length(z)==0 | z=="") {break}
Start <- if(grepl("start",z))
as.numeric(gsub(".+start=(\\d+).+","\\1",z))
if(is.null(Start)){
DF <- rbind(DF,
data.frame(ID=ID,value=as.numeric(z))
)
ID <- ID + 1
} else {
ID <- Start
}
}
return(DF)
}
Tested on your data :
ZZ <- textConnection("fixedStep chrom=chr7 start=10239 step=1
0.064
0.076
fixedStep chrom=chr7 start=10262 step=1
0.076
0.089
fixedStep chrom=chr7 start=10398 step=1
0.045
0.089
")
> ReadFile(ZZ)
ID value
1 10239 0.064
2 10240 0.076
3 10262 0.076
4 10263 0.089
5 10398 0.045
6 10399 0.089
精彩评论