sed: hold pattern and rearrange line

2023-03-10 19:58 问答作者：

I am not sure if I can do this purely with sed:

I am trying to rearrange lines like this

GF:001,GF:00012,GF:01223<TAB>XXR
GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3

GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3

Anyone any hints? The cardinality of GF:XXXX is alternating as the length of GF:XXXX is.

I am stuck with sed -n ' '/$XX.*$$/' { s/,/\t\1\n/ }' input but I cannot reference to the originally matched pattern in the first place. any ideas? cheers!

Update: I think it is not possible to do this with just using sed. So I used perl to do this:

perl -e 'open(IN, "< file");
while (<IN>) {
    @a = split(/\t/);
    @gos = split(/,/, $a[0]);
    foreach (@gos) {
      print $_."\t".$a[1];
    }
close( IN );' > output

开发者_StackOverflow中文版

But if anyone knows a way to solve this just with sed please post it here...

It can be done in sed, though I probably would use Perl (or Awk or Python) to do it.

I claim no elegance for this solution, but brute force and ignorance sometimes pays off. I created a file called, unoriginally, sed.script containing:

/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{
:redo
s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\3@@@@@\2<TAB>\3/
h
s/@@@@@.*//
p
x
s/.*@@@@@//
t redo
d
}

I ran it as:

sed -f sed.script input

where input contained the two lines shown in the question. It produced the output:

GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3

(I took the liberty of deliberately misinterpreting <TAB> to be a 5-character sequence instead of a single tab character; you can easily fix the answer to handle an actual tab character instead.)

Explanation of the sed script:

Find lines with more than one occurrence of GF:nnn separated by commas (we do not need to process lines that contain a single such occurrence). Do the rest of the script only on such lines. Anything else is passed through (printed) unchanged.
Create a label so we can branch back to it
Split the line into 3 remembered parts. The first part is the initial GF information; the second part is any other GF information; the third part is the field after the <TAB>. Replace this with the first field, <TAB>, third field, implausible marker pattern (@@@@@), second field, <TAB>, third field.
Copy the modified line to the hold space.
Delete the marker pattern to the end.
Print.
Swap the hold space into the pattern space.
Remove everything up to and including the marker pattern.
If we've done any work, go back to the redo label.
Delete what's left (it was printed already).
End of script block.

This is a simple loop that reduces the number of the patterns by one on each iteration.

You can do it straightforwardly with awk:

$ awk '{gsub(/,/, "\t" $NF "\n");print}' input

In this case, we just replace the comma by a tab concatenated with the last field (NF stores the number of fields of a record; $NF gets the NFth field) concatenated with a newline. Then, print the result.

It can be solved with sed, too, in a way similar but IMHO a bit better than the Jonathan solution (which is pretty sophisticated, I should remark).

sed -n '
:BEGIN
 h
 s/,.*<TAB>/<TAB>/
 p
 x
 s/^[^,]*,//
t BEGIN' input

Here, we define a label in the beginning of the script:

:BEGIN

Then we copy the content of the pattern space to the hold space:

Now, we replace everything from the first comma until the tab with only a tab:

 s/,.*<TAB>/<TAB>/

We print the result...

...and retrieve the content of the hold space:

Since we printed the first line - which contains the first GF:XXX pattern followed by the final XXR pattern - we remove the first GF:XXX pattern from the line:

 s/^[^,]*,//

If a replacement is executed, we branch to the beginning of script:

t BEGIN

And everything is applied again to the same line, except that now this line does not have the first GF:XXX pattern anymore. OTOH, if no replacement is made, then the processing of the current line is done and we do not jump to the beginning anymore.

If you don't strictly want sed, awk is good at doing this:

awk -F'\t|,' '{ i=1; do { printf("%s\t%s\n",$i,$NF); i++;}  while ( i<NF ); }' inputfile

Well it took me 3 hours to do it

sed -re ':a; s/(GF:[0-9]*[^,]*),([^<]*)(<TAB>[A-Z]*)/\1\3\n\2\3/g;ta; ' file.txt

awk -F'[,\t]' '{for (i=1;i<NF;i++) print $i"\t"$NF}' file

Awk reads one line at a time (by default) and splits the line up into fields. I'm using -F to tell awk to separate the line into fields at each comma or tab. NF is the number of fields in the line, $i is the contents of field number i.

继续阅读：regex sed variables

sed: hold pattern and rearrange line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？