开发者

sed: hold pattern and rearrange line

I am not sure if I can do this purely with sed:

I am trying to rearrange lines like this

GF:001,GF:00012,GF:01223<TAB>XXR
GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3

to

GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3

Anyone any hints? The cardinality of GF:XXXX is alternating as the length of GF:XXXX is.

I am stuck with sed -n ' '/\(XX.*\)$/' { s/,/\t\1\n/ }' input but I cannot reference to the originally matched pattern in the first place. any ideas? cheers!

Update: I think it is not possible to do this with just using sed. So I used perl to do this:

perl -e 'open(IN, "< file");
while (<IN>) {
    @a = split(/\t/);
    @gos = split(/,/, $a[0]);
    foreach (@gos) {
      print $_."\t".$a[1];
    }
close( IN );' > output
开发者_StackOverflow中文版

But if anyone knows a way to solve this just with sed please post it here...


It can be done in sed, though I probably would use Perl (or Awk or Python) to do it.

I claim no elegance for this solution, but brute force and ignorance sometimes pays off. I created a file called, unoriginally, sed.script containing:

/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{
:redo
s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\3@@@@@\2<TAB>\3/
h
s/@@@@@.*//
p
x
s/.*@@@@@//
t redo
d
}

I ran it as:

sed -f sed.script input

where input contained the two lines shown in the question. It produced the output:

GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3

(I took the liberty of deliberately misinterpreting <TAB> to be a 5-character sequence instead of a single tab character; you can easily fix the answer to handle an actual tab character instead.)

Explanation of the sed script:

  • Find lines with more than one occurrence of GF:nnn separated by commas (we do not need to process lines that contain a single such occurrence). Do the rest of the script only on such lines. Anything else is passed through (printed) unchanged.
  • Create a label so we can branch back to it
  • Split the line into 3 remembered parts. The first part is the initial GF information; the second part is any other GF information; the third part is the field after the <TAB>. Replace this with the first field, <TAB>, third field, implausible marker pattern (@@@@@), second field, <TAB>, third field.
  • Copy the modified line to the hold space.
  • Delete the marker pattern to the end.
  • Print.
  • Swap the hold space into the pattern space.
  • Remove everything up to and including the marker pattern.
  • If we've done any work, go back to the redo label.
  • Delete what's left (it was printed already).
  • End of script block.

This is a simple loop that reduces the number of the patterns by one on each iteration.


You can do it straightforwardly with awk:

$ awk '{gsub(/,/, "\t" $NF "\n");print}' input 

In this case, we just replace the comma by a tab concatenated with the last field (NF stores the number of fields of a record; $NF gets the NFth field) concatenated with a newline. Then, print the result.

It can be solved with sed, too, in a way similar but IMHO a bit better than the Jonathan solution (which is pretty sophisticated, I should remark).

sed -n '
:BEGIN
 h
 s/,.*<TAB>/<TAB>/
 p
 x
 s/^[^,]*,//
t BEGIN' input

Here, we define a label in the beginning of the script:

:BEGIN

Then we copy the content of the pattern space to the hold space:

h

Now, we replace everything from the first comma until the tab with only a tab:

 s/,.*<TAB>/<TAB>/

We print the result...

p

...and retrieve the content of the hold space:

x

Since we printed the first line - which contains the first GF:XXX pattern followed by the final XXR pattern - we remove the first GF:XXX pattern from the line:

 s/^[^,]*,//

If a replacement is executed, we branch to the beginning of script:

t BEGIN

And everything is applied again to the same line, except that now this line does not have the first GF:XXX pattern anymore. OTOH, if no replacement is made, then the processing of the current line is done and we do not jump to the beginning anymore.


If you don't strictly want sed, awk is good at doing this:

awk -F'\t|,' '{ i=1; do { printf("%s\t%s\n",$i,$NF); i++;}  while ( i<NF ); }' inputfile


Well it took me 3 hours to do it

sed -re ':a; s/(GF:[0-9]*[^,]*),([^<]*)(<TAB>[A-Z]*)/\1\3\n\2\3/g;ta; ' file.txt


awk -F'[,\t]' '{for (i=1;i<NF;i++) print $i"\t"$NF}' file

Awk reads one line at a time (by default) and splits the line up into fields. I'm using -F to tell awk to separate the line into fields at each comma or tab. NF is the number of fields in the line, $i is the contents of field number i.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜