Extract substructure from a text file using bash or python
I have a huge text file, which follows the structure:
SET
TAG1
...
...
SET
...
SET
TAG2
...
...
SET
...
...
I would like to extract for a specific TAG, (i.e. TAG54) its individual "substructure", which would be
SET
TAG54
..开发者_运维百科.
...
SET
Each substructure, for a given TAG_i contains always:
first line:SET second line:TAG_i (in this case TAG54) an arbitrary number of lines last line:SET
I wonder what would be the best way to do this, whether in bash or python, so for a given TAG, one can "extract" this substructure.
Thanks
Here's a Python approach: you pass in the open file handle as the first argument, the tag number as second argument, and get back as the result a list of the relevant lines (including newline characters), or an empty line if the tag is not found in the file:
def lookfor(f, tagnum):
tag = 'TAG%s\n' % tagnum
for line in f:
if line == tag:
break
else: # file finished, tag not found
return []
result = ['SET\n', tag]
for line in f:
result.append(line)
if line == 'SET\n':
break
return result
This should be reasonably well-performing. If you want other forms of arguments and/or results, it shouldn't be hard to tweak accordingly, of course.
If your system's grep
supports -P
for perl regexp:
grep -P 'SET\nTAG54\n[.\n]*\nSET' file.txt
gawk:
BEGIN {
state=0
}
state==0 && $0=="TAG54" {
print "SET"
state=1
}
state==1 {
print
}
state==1 && $0=="SET" {
exit
}
csplit -f tags input.txt '%^TAG54$%-1' '/^SET$/+1' '%.*%' '{*}'
$ awk -vRS="SET" '/TAG54/{print RT$0RT}' file
SET
TAG54
...
...
SET
if you are doing it with shell scripting, pass your shell variable to awk
using -v
. eg
#!/bin/bash
read -r -p "what's your tag? " tag
awk -vRS="SET" -vt="$tag" '$0~tag{print RT$0RT}' file
精彩评论