Breaking up a character string into multiple character strings on different lines

2023-04-12 04:42 问答作者：

I have a data frame that contains a long character string each associated with a 'Sample':

Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N

I would like to code an easy way to break this string into 5 pieces in the following format:

Sample X
CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168

Giving an output that looks like this for each sample:

Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

I've been a开发者_StackOverflow中文版ble to use the substr function to break the long string into individual pieces but id like to able to automate it so I can get all 5 pieces in one output. Ideally this output would also be a data frame.

This is what ?read.fwf is for.

First some data which looks like your question:

x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
stringsAsFactors = FALSE)

Now use read.fwf, specify the widths of each field and their names, and that all should be of mode character. We wrap the text column of the example data in textConnection so that we can treat it like a connection understood generally by the read.* and other functions.

(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15")))


                               CCT6                                GAT1                            IMD3                            PDR3                                  RIM15
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N

Now loop over the rows and print out each one as per your example:

for (i in 1:nrow(strs)) {
  writeLines(paste("Sample", i))
  writeLines(paste(names(strs), strs[i, ], sep = " - "))
}

Giving, for example:

Sample 2
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

SampX <- textConnection("CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168")
dfSampX <-read.table(SampX, sep="-")
dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2))

sampdat <- read.table(textConnection("Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
"), header=TRUE,stringsAsFactors=FALSE)

This code will break into segments:

 apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"]) )
     [,1]                                [,2]                                 
[1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
[2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
     [,3]                              [,4]                             
[1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
[2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
     [,5]                                    
[1,] "0000000000000000000N000000N0000000000N"
[2,] "0000000000000000000N000000N0000000000N"

This code would deliver the fragments in list format:

res <- lapply(sampdat$Data, function(x) 
           apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]) ))

res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)} )
res2

[[1]]
                                   CCT6                                     GAT1  
     "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                   IMD3                                     PDR3  
       "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                  RIM15  
"0000000000000000000N000000N0000000000N" 

[[2]]
                                   CCT6                                     GAT1  
     "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                   IMD3                                     PDR3  
       "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                  RIM15  
"0000000000000000000N000000N0000000000N"

And to get the specified output format:

 for (samp in seq_along(res2) ) { cat("Sample ", samp, "\n")
         invisible( sapply(1:5, function(y) 
            cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n") ) ) }
Sample  1 
CCT6   -  000000000000000000000000000N01000 
GAT1   -  000000000N0N000000000N00N0000NN00N0 
IMD3   -  N000000100000N00N0N0000000NNNN0 
PDR3   -  1111111111111111111111111111111 
RIM15   -  0000000000000000000N000000N0000000000N 
Sample  2 
CCT6   -  000000000000000000000000000N01000 
GAT1   -  000000000N0N000000000N00N0000NN00N0 
IMD3   -  N000000100000N00N0N0000000NNNN0 
PDR3   -  1111111111111111111111111111111 
RIM15   -  0000000000000000000N000000N0000000000N

The invisible was needed to suppress the NULL returns from the list structure.

继续阅读：character dataframe

Breaking up a character string into multiple character strings on different lines

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？