sed: hold pattern and rearrange line
I am not sure if I can do this purely with sed:
I am trying to rearrange lines like this
GF:001,GF:00012,GF:01223<TAB>XXR
GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3
to
GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3
Anyone any hints? The cardinality of GF:XXXX is alternating as the length of GF:XXXX is.
I am stuck with sed -n '
'/\(XX.*\)$/' {
s/,/\t\1\n/
}' input
but I cannot reference to the originally matched pattern in the first place. any ideas? cheers!
Update: I think it is not possible to do this with just using sed. So I used perl to do this:
perl -e 'open(IN, "< file");
while (<IN>) {
@a = split(/\t/);
@gos = split(/,/, $a[0]);
foreach (@gos) {
print $_."\t".$a[1];
}
close( IN );' > output
开发者_StackOverflow中文版
But if anyone knows a way to solve this just with sed
please post it here...
It can be done in sed
, though I probably would use Perl (or Awk or Python) to do it.
I claim no elegance for this solution, but brute force and ignorance sometimes pays off. I created a file called, unoriginally, sed.script
containing:
/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{
:redo
s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\3@@@@@\2<TAB>\3/
h
s/@@@@@.*//
p
x
s/.*@@@@@//
t redo
d
}
I ran it as:
sed -f sed.script input
where input
contained the two lines shown in the question. It produced the output:
GF:001<TAB>XXR
GF:00012<TAB>XXR
GF:01223<TAB>XXR
GF:001<TAB>XXXR3
GF:00012<TAB>XXXR3
GF:01223<TAB>XXXR3
GF:0666<TAB>XXXR3
(I took the liberty of deliberately misinterpreting <TAB>
to be a 5-character sequence instead of a single tab character; you can easily fix the answer to handle an actual tab character instead.)
Explanation of the sed
script:
- Find lines with more than one occurrence of
GF:nnn
separated by commas (we do not need to process lines that contain a single such occurrence). Do the rest of the script only on such lines. Anything else is passed through (printed) unchanged. - Create a label so we can branch back to it
- Split the line into 3 remembered parts. The first part is the initial GF information; the second part is any other GF information; the third part is the field after the
<TAB>
. Replace this with the first field,<TAB>
, third field, implausible marker pattern (@@@@@
), second field,<TAB>
, third field. - Copy the modified line to the hold space.
- Delete the marker pattern to the end.
- Print.
- Swap the hold space into the pattern space.
- Remove everything up to and including the marker pattern.
- If we've done any work, go back to the
redo
label. - Delete what's left (it was printed already).
- End of script block.
This is a simple loop that reduces the number of the patterns by one on each iteration.
You can do it straightforwardly with awk:
$ awk '{gsub(/,/, "\t" $NF "\n");print}' input
In this case, we just replace the comma by a tab concatenated with the last field (NF
stores the number of fields of a record; $NF
gets the NF
th field) concatenated with a newline. Then, print the result.
It can be solved with sed, too, in a way similar but IMHO a bit better than the Jonathan solution (which is pretty sophisticated, I should remark).
sed -n '
:BEGIN
h
s/,.*<TAB>/<TAB>/
p
x
s/^[^,]*,//
t BEGIN' input
Here, we define a label in the beginning of the script:
:BEGIN
Then we copy the content of the pattern space to the hold space:
h
Now, we replace everything from the first comma until the tab with only a tab:
s/,.*<TAB>/<TAB>/
We print the result...
p
...and retrieve the content of the hold space:
x
Since we printed the first line - which contains the first GF:XXX
pattern followed by the final XXR
pattern - we remove the first GF:XXX
pattern from the line:
s/^[^,]*,//
If a replacement is executed, we branch to the beginning of script:
t BEGIN
And everything is applied again to the same line, except that now this line does not have the first GF:XXX
pattern anymore. OTOH, if no replacement is made, then the processing of the current line is done and we do not jump to the beginning anymore.
If you don't strictly want sed, awk is good at doing this:
awk -F'\t|,' '{ i=1; do { printf("%s\t%s\n",$i,$NF); i++;} while ( i<NF ); }' inputfile
Well it took me 3 hours to do it
sed -re ':a; s/(GF:[0-9]*[^,]*),([^<]*)(<TAB>[A-Z]*)/\1\3\n\2\3/g;ta; ' file.txt
awk -F'[,\t]' '{for (i=1;i<NF;i++) print $i"\t"$NF}' file
Awk reads one line at a time (by default) and splits the line up into fields. I'm using -F to tell awk to separate the line into fields at each comma or tab. NF is the number of fields in the line, $i is the contents of field number i.
精彩评论