开发者

Any faster alternative to reading nth line of a file

I have to read a file from a particular line number and I know the line number say "n": I have been thinking of two ways:

1. for i in range(n): fname.readline() k=readline() print k

2. i=0 for line in fname: dictionary[i]=line i=i+1

but I want a faster alternative as I might have to perform this on different files 20000 times. Are there any better alternatives?

Also, are there are other 开发者_JAVA百科performance enhancements for simple looping, as my code has nested loops.


If the files aren't too huge, the linecache module of the standard library is pretty good -- it lets you very directly ask for the Nth line of such-and-such file.

If the files are huge, I recommend something like (warning, untested code):

def readlinenum(filepath, n, BUFSIZ=65536):
  bufs = [None] * 2
  previous_lines = lines_so_far = 0
  with open(filepath, 'b') as f
    while True:
      bufs[0] = f.read(BUFSIZ)
      if not bufs[0]:
        raise ValueError('File %s has only %d lines, not %d',
                         filepath, lines_so_far, n)
      lines_this_block = bufs[0].count('\n')
      updated_lines_count = lines_so_far + lines_this_block
      if n < updated_lines_count:
          break
      previous_lines = lines_so_far
      lines_so_far = updated_lines_count
      bufs[1] = bufs[0]
    if n == lines_so_far:
      # line split between blocks
      buf = bufs[1] + bufs[0]
      delta = n - previous_lines
    else:  # normal case
      buf = bufs[0]
      delta = n = lines_so_far
    f = cStringIO.StringIO(buf)
    for i, line in enumerate(f):
      if i == delta: break
    return line.rstrip()

The general idea is to read in the file as binary, in large blocks (at least as large as the longest possible line) -- the processing (on Windows) from binary to "text" is costly on huge files -- and use the fast .count method of strings on most blocks. At the end we can do the line parsing on a single block (two at most in the anomalous case where the line being sought spans block boundaries).

This kind of code requires careful testing and checking (which I haven't performed in this case), being prone to off-by-one and other boundary errors, so I'd recommend it only for truly huge files -- ones that would essentially bust memory if using linecache (which just sucks up the whole file into memory instead of working by blocks). On a typical modern machine with 4GB bytes of RAM, for example, I'd start thinking about such techniques for text files that are over a GB or two.

Edit: a commenter does not believe that binary reading a file is much faster than the processing required by text mode (on Windows only). To show how wrong this is, let's use the 'U' ("universal newlines") option that forces the line-end processing to happen on Unix machines too (as I don't have a Windows machine to run this on;-). Using the usual kjv.txt file:

$ wc kjv.txt
  114150  821108 4834378 kjv.txt

(4.8 MB, 114 Klines) -- about 1/1000th of the kind of file sizes I was mentioning earlier:

$ python -mtimeit 'f=open("kjv.txt", "rb")' 'f.seek(0); f.read()'
100 loops, best of 3: 13.9 msec per loop
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); f.read()'
10 loops, best of 3: 39.2 msec per loop

i.e., just about exactly a factor of 3 cost for the line-end processing (this is on an old-ish laptop, but the ratio should be pretty repeatable elsewhere, too).

Reading by a loop on lines, of course, is even slower:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0)' 'for x in f: pass'
10 loops, best of 3: 54.6 msec per loop

and using readline as the commented mentioned (with less efficient buffering than directly looping on the file) is worst:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); x=1' 'while x: x=f.readline()'
10 loops, best of 3: 81.1 msec per loop

If, as the question mentions, there are 20,000 files to read (say they're all small-ish, on the order of this kjv.txt), the fastest approach (reading each file in binary mode in a single gulp) should take about 260 seconds, 4-5 minutes, while the slowest one (based on readline) should take about 1600 seconds, almost half an hour -- a pretty significant difference for many, I'd say most, actual applications.


Unless you know, or can figure out, the offset of line n in your file (for example, if every line were of a fixed width), you will have to read lines until you get to the nth one.

Regarding your examples:

  • xrange is faster than range since range has to generate a list, whereas xrange uses a generator
  • if you only need line n, why are you storing all of the lines in a dictionary?


Caching a list of offsets of every end-of-line character in the file would cost a lot of memory, but caching roughly one per memory page (generally 4KB) gives mostly the same reduction in I/O, and the cost of scanning a couple KB from a known offset is negligible. So, if your average line length is 40 characters, you only have to cache a list of every 100th end-of-line in the file. Exactly where you draw the line depends on how much memory you have and how fast your I/O is. You might even be able to get away with caching a list of the offsets of every 1000th end-of-line character without a noticeable difference in performance from indexing every single one.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜