开发者

Getting one line in a huge file with bash

How can I get a开发者_如何转开发 particular line in a 3 gig text file. All the lines have:

  • the same length, and
  • are delimited by \n.

And I need to be able to get any line on demand.

How can this be done? Only one line need be returned.


If all the lines have the same length, the best way by far will be to use dd(1) and give it a skip parameter.

Let the block size be the length of each line (including the newline), then you can do:

$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null

The idea is to seek past all the previous lines (skip=<line_no - 1>) and read a single line (count=1). Because the block size is set to the line length (bs=<line-length>), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.

That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd will seek to the position you want in the file and read only one line of data from the file.


head -10 file | tail -1 returns line 10 probably slow though.

from here

# print line number 52 
sed -n '52p' # method 1 
sed '52!d' # method 2 
sed '52q;d' # method 3, efficient on large files


An awk alternative, where 3 is the line number.

awk 'NR == 3 {print; exit}' file.txt


If it's not a fixed-record-length file and you don't do some sort of indexing on the line starts, your best bet is to just use:

head -n N filespec | tail -1

where N is the line number you want.

This isn't going to be the best-performing piece of code for a 3Gb file unfortunately but there are ways to make it better.

If the file doesn't change too often, you may want to consider indexing it. By that I mean having another file with the line offsets in it as fixed length records.

So the file:

0000000000
0000000017
0000000092
0000001023

would give you an fast way to locate each line. Just multiply the desired line number by the index record size and seek to there in the index file.

Then use the value at that location to seek in the main file so you can read until the next newline character.

So for line 3, you would seek to 33 in the index file (index record length is 10 characters plus one more for the newline). Reading the value there, 0000000092, would give you the offset to use into the main file.

Of course, that's not so useful if the file changes frequently although, if you can control what happens when things get appended, you can still add offsets to the index efficiently. If you don't control that, you'll have to re-index whenever the last-modified date of the index is earlier than that of the main file.


And, based on your update:

Update: If it matters, all the lines have the same length.

With that extra piece of information, you don't need the index - you can just seek immediately to the right location in the main file by multiplying the record length by the record length (assuming the values fit into your data types).

So something like the pseudo-code:

def getline(fhandle,reclen,recnum):
    seek to position reclen*recnum for file fhandle.
    read reclen characters into buffer.
    return buffer.


Use q with sed to make the search stop after the line has been printed.

sed -n '11723{p;q}' filename

Python (minimal error checking):

#!/usr/bin/env python
import sys

# by Dennis Williamson - 2010-05-08
# for http://stackoverflow.com/questions/2794049/getting-one-line-in-a-huge-file-with-bash

# seeks the requested line in a file with a fixed line length

# Usage: ./lineseek.py LINE FILE

# Example: ./lineseek 11723 data.txt

EXIT_SUCCESS      = 0
EXIT_NOT_FOUND    = 1
EXIT_OPT_ERR      = 2
EXIT_FILE_ERR     = 3
EXIT_DATA_ERR     = 4

# could use a try block here
seekline = int(sys.argv[1])

file = sys.argv[2]

try:
    if file == '-':
        handle = sys.stdin
        size = 0
    else:
        handle = open(file,'r')
except IOError as e:
    print >> sys.stderr, ("File Open Error")
    exit(EXIT_FILE_ERR)

try:
    line = handle.readline()
    lineend = handle.tell()
    linelen = len(line)
except IOError as e:
    print >> sys.stderr, ("File I/O Error")
    exit(EXIT_FILE_ERR)

# it would be really weird if this happened
if lineend != linelen:
    print >> sys.stderr, ("Line length inconsistent")
    exit(EXIT_DATA_ERR)

handle.seek(linelen * (seekline - 1))

try:
    line = handle.readline()
except IOError as e:
    print >> sys.stderr, ("File I/O Error")
    exit(EXIT_FILE_ERR)

if len(line) != linelen:
    print >> sys.stderr, ("Line length inconsistent")
    exit(EXIT_DATA_ERR)

print(line)

Argument validation should be a lot better and there is room for many other improvements.


A quick perl one liner would work well for this too...

$ perl -ne 'if (YOURLINENUMBER..YOURLINENUMBER) {print $_; last;}' /path/to/your/file
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜