Adding a new column based on the column header

2023-02-16 01:36 问答作者：

I am working with R to convert several million lines of data in a single column like this.....

fixedStep chrom=chr7 start=10239 step=1
0.064
0.064
0.064
0.055
0.055
0.089
0.076
fixedStep chrom=chr7 start=10262 step=1
0.076
0.076
0.089
0.076
0.076
0.076
0.076
0.089
0.089
0.076
0.089
0.076
0.089
0.089
fixedStep chrom=chr7 start=10398 step=1
0.076
0.089
0.089
0.089
0.089
0.076

to this......

10239 0.064
10240 0.064
10241 0.064
10242 0.055
10243 0.055
10244 0.089
10245 0.076
10262 0.076
10263 0.076
10264 0.089
10265 0.076
10266 0.076
10267 0.076
10268 0.076
10269 0.089
10270 0.089
10271 0.076
10272 0.089
10273 0.076
10274 0.089
10275 0.089
10398 0.076
10399 0.089
10400 0.089
10401 0.089
10402 0.089
10403 0.076

i.e., I want to add a new column of numbers (either before or after the data, in the example above, it is before the data). The numbers of the new column start from start=value, and increments by 1 (step=1), until a new column header (fixedStep chrom=chr7 start=10262 step=1) is reached. When this happens the numbers start from start=new value, and again i开发者_C百科ncrements by 1 (step=1), until a new column header..and so on and so forth.

Since it is a large file, I cannot load the it into R workspace. It would be good to combine it with UNIX/linux tools to perform this operation.

Since you said unix...

#!/usr/bin/awk -f
/^fixedStep/ {
  i=int(substr($0,match($0,"start=")+6))
  d=int(substr($0,match($0,"step=")+5))
}
!/^f/ { print i, $0; i+=d }

What it does: on lines that start with "fixedStep" it finds the position of "start=", adds 6 (the length of "start="), takes the substring starting at that position and truncates it to an integer value and assigns it to i (in awk integers and strings are pretty interchangeable, and "12345 step=1" works fine as an integer with value 12345, but we want to print only the integer part later so it makes sense to truncate it here). Similarly for "step=".

On lines not starting with "f", it prints i and the line, and adds d to i.

You could use the function readLines to scan the file one by one. This way you don't have to load everything into the memory. One possibility would be to use following function.

Not concise, not fast (as you read line by line), but it gives you the dataframe without reading in the whole file. If you only want to generate a new file, by all means use something else but R. The awk solution given here seems the best shot at it.

ReadFile <- function(file){

  DF <- data.frame(ID=numeric(0),value=numeric(0))

   while(1){

    z <- readLines(file,1)
    if(length(z)==0 | z=="") {break}

    Start <- if(grepl("start",z))
      as.numeric(gsub(".+start=(\\d+).+","\\1",z))

    if(is.null(Start)){
        DF <- rbind(DF,
            data.frame(ID=ID,value=as.numeric(z))
        )
        ID <- ID + 1
    } else {
      ID <- Start
    }
  }
  return(DF)
}

Tested on your data :

ZZ <- textConnection("fixedStep chrom=chr7 start=10239 step=1
0.064
0.076
fixedStep chrom=chr7 start=10262 step=1
0.076
0.089
fixedStep chrom=chr7 start=10398 step=1
0.045
0.089
")

> ReadFile(ZZ)
     ID value
1 10239 0.064
2 10240 0.076
3 10262 0.076
4 10263 0.089
5 10398 0.045
6 10399 0.089

继续阅读：r

Adding a new column based on the column header

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？