开发者

Removing Parts of String With Sed

I have lines of data that looks like this:

sp_A0A342_ATPB_COFAR_6_+_contigs_full.fasta
sp_A0A342_ATPB_COFAR_9_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_10_-_contigs_full.fasta
sp_A0A373_RK16_COFAR_8_+_contigs_full.fasta
sp_A0A4W3_SPEA_GEOSL_15_-_contigs_full.fasta

How can I use sed to delete parts of string after 4th column (_ separated) for each line. Finally yielding:

sp_A0A342_ATPB_COFAR
sp_A0A342_ATPB_COFAR
sp_A0A373_RK16_COFAR
s开发者_开发技巧p_A0A373_RK16_COFAR
sp_A0A4W3_SPEA_GEOSL


cut is a better fit.

cut -d_ -f 1-4 old_file

This simply means use _ as delimiter, and keep fields 1-4.

If you insist on sed:

sed 's/\(_[^_]*\)\{4\}$//'

This left hand side matches exactly four repetitions of a group, consisting of an underscore followed by 0 or more non-underscores. After that, we must be at the end of the line. This is all replaced by nothing.


sed -e 's/\([^_]*\)_\([^_]*\)_\([^_]*\)_\([^_]*\)_.*/\1_\2_\3_\4' infile > outfile

Match "any number of not '_'", saving what was matched between \( and \), followed by '_'. Do this 4 times, then match anything for the rest of the line (to be ignored). Substitute with each of the matches separated by '_'.


Here's another possibility:

sed -E -e 's|^([^_]+(_[^_]+){3}).*$|\1|'

where -E, like -r in GNU sed, turns on extended regular expressions for readability.

Just because you can do it in sed, though, doesn't mean you should. I like cut much much better for this.


AWK likes to play in the fields:

awk 'BEGIN{FS=OFS="_"}{print $1,$2,$3,$4}' inputfile

or, more generally:

awk -v count=4 'BEGIN{FS="_"}{for(i=1;i<=count;i++){printf "%s%s",sep,$i;sep=FS};printf "\n"}'


sed -e 's/_[0-9][0-9]*_[+-]_contigs_full.fasta$//g'

Still the cut answer is probably faster and just generally better.


Yes, cut is way better, and yes matching the back of each is easier.

I finally got a match using the beginning of each line:

 sed -r 's/(([^_]*_){3}([^_]*)).*/\1/' oldFile > newFile
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜