开发者

Using grep/gsub To Find First Colon Only

I have a long file which is written as ONLY one column.

This column contains gene names followed by a colon (:) then by the name of a microRNA fragment. Unfortunately, the microRNA name MAY ALSO contain a colon (:).

I want to replace ONLY the first colon with a tab (\t) and then write.table to produce two columns in R.

H开发者_开发知识库ere is a representative sample of one gene name with multiple microRNAs:

CHD5:miR-329/362-3p:2
CHD5:miR-329/362-3p:1
CHD5:miR-30a/30a-5p/30b/30b-5p/30cde/384-5p
CHD5:miR-15/16/195/424/497
CHD5:miR-26ab/1297
CHD5:miR-17-5p/20/93.mr/106/519.d
CHD5:miR-130/301
CHD5:miR-19
CHD5:miR-204/211

Any suggestions?


Maybe use sub instead of gsub?


Here's a slightly more complete example if you have your 'inFile' and want 'outFile'...

lines <- readLines('inFile')
lines <- sub(':', '\t', x)
writeLines(lines, 'outFile')


If x is your column or vector:

sub(":", "\t", x)

See ?sub, which says

‘sub’ and ‘gsub’ perform replacement of the first and all matches respectively.


If you are all right with using sed, you can do the following (assuming your data is in a file named data.txt).

sed 's/\([^:]\):/\1 /' data.txt

That space after the \1 is really a tab. To insert it in my shell, I needed to do Ctrl-v, <tab>.

Here's my result after running the command:

CHD5    miR-329/362-3p:2
CHD5    miR-329/362-3p:1
CHD5    miR-30a/30a-5p/30b/30b-5p/30cde/384-5p
CHD5    miR-15/16/195/424/497
CHD5    miR-26ab/1297
CHD5    miR-17-5p/20/93.mr/106/519.d
CHD5    miR-130/301
CHD5    miR-19
CHD5    miR-204/211
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜