Bash or Python for extracting blocks from text files [closed]

2022-12-16 17:30 问答作者：

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this pos开发者_开发问答t.

Closed 5 months ago.

Improve this question

I have a huge text file, which is structured as:

SEPARATOR
STRING1
(arbitrary number of lines)
SEPARATOR
...
SEPARATOR
STRING2
(arbitrary number of lines)
SEPARATOR
SEPARATOR
STRING3
(arbitrary number of lines)
SEPARATOR
....

What only changes between the different "blocks" of the file is the STRING and the content between the separator. I need to get a script in bash or python which given a STRING_i in the input, gives as output a file, which contains

SEPARATOR
STRING_i
(number of lines for this string)
SEPARATOR

What is the best approach here to use bash or python? Another option? It must also be fast.

Thanks

In Python 2.6 or better:

def doit(inf, ouf, thestring, separator='SEPARATOR\n'):
  thestring += '\n'
  for line in inf:
    # here we're always at the start-of-block separator
    assert line == separator
    blockid = next(inf)
    if blockid == thestring:
      # found block of interest, use enumerate to count its lines
      for c, line in enumerate(inf):
        if line == separator: break
      assert line == separator
      # emit results and terminate function
      ouf.writelines((separator, thestring, '(%d)' % c, separator))
      inf.close()
      ouf.close()
      return
    # non-interesting block, just skip it
    for line in inf:
      if line == separator: break

In older Python versions you can do almost the same, but change the line blockid = next(inf) to blockid = inf.next().

The assumptions here are that the input and output files are opened by the caller (which also passes in the interesting values of thestring, and optionally separator) but it's this function's job to close them (e.g. for maximum ease of use as a pipeline filter, with inf of sys.stdin and ouf of sys.stdout); easy to tweak if needed of course.

Removing the asserts will speed it up microscopically, but I like their "sanity checking" role (and they may also help understand the logic of the code flow).

Key to this approach is that a file is an iterator (of lines) and iterators can be advanced in multiple places (so we can have multiple for statements, or specific "advance the iterator" calls such as next(inf), and they cooperate properly).

I would use Python and write something similar to this:

import sys

file = open("file", "r")
counter = 0
count = False
for line in file:
  if count:
    counter += 1
  if count and SEPARATOR == line:
    break
  if not count and sys.argv[1] == line:
    count = True
print SEPARATOR, sys.argv[1], counter, SEPARATOR
file.close()

If you want this to be fast, you need to avoid reading the entire file to find the block of data you need.

read over the file once and store an index of a) byte offset for the start of each STRING_I and b) length (bytes) of block - distance to the next SEPARATOR in bytes. You can store this index in a separate file or in a 'header' of the current file
for each STRING_I query - read in index

    if STRING_I in index:
     file.seek( start_byte_location )
     file.read( length )
     return parse_with_any_of_procedures_above # like @gruszczy's doit() but w/o loop

don't go overboard with the index: use a dict of STRING_I -> ( location,length), and just simplejson / pickle it out to a file

you can use (g)awk, which is a relatively fast tool to process files.

read -p "Enter input: " input
awk -vinput="$input" -vRS="SEPARATOR" '$0~input{ printf RT; print $0; printf RT }' file

继续阅读：bash python

Bash or Python for extracting blocks from text files [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？