Bash or Python for extracting blocks from text files [closed]
Want to improve this question? Update the question so it can be answered with facts and citations by editing this pos开发者_开发问答t.
Closed 5 months ago.
Improve this questionI have a huge text file, which is structured as:
SEPARATOR
STRING1
(arbitrary number of lines)
SEPARATOR
...
SEPARATOR
STRING2
(arbitrary number of lines)
SEPARATOR
SEPARATOR
STRING3
(arbitrary number of lines)
SEPARATOR
....
What only changes between the different "blocks" of the file is the STRING and the content between the separator. I need to get a script in bash or python which given a STRING_i in the input, gives as output a file, which contains
SEPARATOR
STRING_i
(number of lines for this string)
SEPARATOR
What is the best approach here to use bash or python? Another option? It must also be fast.
Thanks
In Python 2.6 or better:
def doit(inf, ouf, thestring, separator='SEPARATOR\n'):
thestring += '\n'
for line in inf:
# here we're always at the start-of-block separator
assert line == separator
blockid = next(inf)
if blockid == thestring:
# found block of interest, use enumerate to count its lines
for c, line in enumerate(inf):
if line == separator: break
assert line == separator
# emit results and terminate function
ouf.writelines((separator, thestring, '(%d)' % c, separator))
inf.close()
ouf.close()
return
# non-interesting block, just skip it
for line in inf:
if line == separator: break
In older Python versions you can do almost the same, but change the line blockid = next(inf)
to blockid = inf.next()
.
The assumptions here are that the input and output files are opened by the caller (which also passes in the interesting values of thestring
, and optionally separator
) but it's this function's job to close them (e.g. for maximum ease of use as a pipeline filter, with inf of sys.stdin
and ouf of sys.stdout
); easy to tweak if needed of course.
Removing the assert
s will speed it up microscopically, but I like their "sanity checking" role (and they may also help understand the logic of the code flow).
Key to this approach is that a file is an iterator (of lines) and iterators can be advanced in multiple places (so we can have multiple for
statements, or specific "advance the iterator" calls such as next(inf)
, and they cooperate properly).
I would use Python and write something similar to this:
import sys
file = open("file", "r")
counter = 0
count = False
for line in file:
if count:
counter += 1
if count and SEPARATOR == line:
break
if not count and sys.argv[1] == line:
count = True
print SEPARATOR, sys.argv[1], counter, SEPARATOR
file.close()
If you want this to be fast, you need to avoid reading the entire file to find the block of data you need.
- read over the file once and store an index of a) byte offset for the start of each STRING_I and b) length (bytes) of block - distance to the next SEPARATOR in bytes. You can store this index in a separate file or in a 'header' of the current file
- for each STRING_I query - read in index
if STRING_I in index: file.seek( start_byte_location ) file.read( length ) return parse_with_any_of_procedures_above # like @gruszczy's doit() but w/o loop
don't go overboard with the index: use a dict of STRING_I -> ( location,length), and just simplejson / pickle it out to a file
you can use (g)awk, which is a relatively fast tool to process files.
read -p "Enter input: " input
awk -vinput="$input" -vRS="SEPARATOR" '$0~input{ printf RT; print $0; printf RT }' file
精彩评论