Using grep/gsub To Find First Colon Only
I have a long file which is written as ONLY one column.
This column contains gene names followed by a colon (:) then by the name of a microRNA fragment. Unfortunately, the microRNA name MAY ALSO contain a colon (:).I want to replace ONLY the first colon with a tab (\t) and then write.table to produce two columns in R.
H开发者_开发知识库ere is a representative sample of one gene name with multiple microRNAs:
CHD5:miR-329/362-3p:2
CHD5:miR-329/362-3p:1
CHD5:miR-30a/30a-5p/30b/30b-5p/30cde/384-5p
CHD5:miR-15/16/195/424/497
CHD5:miR-26ab/1297
CHD5:miR-17-5p/20/93.mr/106/519.d
CHD5:miR-130/301
CHD5:miR-19
CHD5:miR-204/211
Any suggestions?
Maybe use sub
instead of gsub
?
Here's a slightly more complete example if you have your 'inFile' and want 'outFile'...
lines <- readLines('inFile')
lines <- sub(':', '\t', x)
writeLines(lines, 'outFile')
If x
is your column or vector:
sub(":", "\t", x)
See ?sub
, which says
‘sub’ and ‘gsub’ perform replacement of the first and all matches respectively.
If you are all right with using sed
, you can do the following (assuming your data is in a file named data.txt
).
sed 's/\([^:]\):/\1 /' data.txt
That space after the \1
is really a tab. To insert it in my shell, I needed to do Ctrl-v, <tab>.
Here's my result after running the command:
CHD5 miR-329/362-3p:2
CHD5 miR-329/362-3p:1
CHD5 miR-30a/30a-5p/30b/30b-5p/30cde/384-5p
CHD5 miR-15/16/195/424/497
CHD5 miR-26ab/1297
CHD5 miR-17-5p/20/93.mr/106/519.d
CHD5 miR-130/301
CHD5 miR-19
CHD5 miR-204/211
精彩评论