Getting one line in a huge file with bash
How can I get a开发者_如何转开发 particular line in a 3 gig text file. All the lines have:
- the same length, and
- are delimited by
\n
.
And I need to be able to get any line on demand.
How can this be done? Only one line need be returned.
If all the lines have the same length, the best way by far will be to use dd(1)
and give it a skip parameter.
Let the block size be the length of each line (including the newline), then you can do:
$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null
The idea is to seek past all the previous lines (skip=<line_no - 1>
) and read a single line (count=1
). Because the block size is set to the line length (bs=<line-length>
), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.
That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd
will seek to the position you want in the file and read only one line of data from the file.
head -10 file | tail -1
returns line 10 probably slow though.
from here
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
An awk alternative, where 3 is the line number.
awk 'NR == 3 {print; exit}' file.txt
If it's not a fixed-record-length file and you don't do some sort of indexing on the line starts, your best bet is to just use:
head -n N filespec | tail -1
where N
is the line number you want.
This isn't going to be the best-performing piece of code for a 3Gb file unfortunately but there are ways to make it better.
If the file doesn't change too often, you may want to consider indexing it. By that I mean having another file with the line offsets in it as fixed length records.
So the file:
0000000000
0000000017
0000000092
0000001023
would give you an fast way to locate each line. Just multiply the desired line number by the index record size and seek to there in the index file.
Then use the value at that location to seek in the main file so you can read until the next newline character.
So for line 3, you would seek to 33 in the index file (index record length is 10 characters plus one more for the newline). Reading the value there, 0000000092
, would give you the offset to use into the main file.
Of course, that's not so useful if the file changes frequently although, if you can control what happens when things get appended, you can still add offsets to the index efficiently. If you don't control that, you'll have to re-index whenever the last-modified date of the index is earlier than that of the main file.
And, based on your update:
Update: If it matters, all the lines have the same length.
With that extra piece of information, you don't need the index - you can just seek immediately to the right location in the main file by multiplying the record length by the record length (assuming the values fit into your data types).
So something like the pseudo-code:
def getline(fhandle,reclen,recnum):
seek to position reclen*recnum for file fhandle.
read reclen characters into buffer.
return buffer.
Use q
with sed
to make the search stop after the line has been printed.
sed -n '11723{p;q}' filename
Python (minimal error checking):
#!/usr/bin/env python
import sys
# by Dennis Williamson - 2010-05-08
# for http://stackoverflow.com/questions/2794049/getting-one-line-in-a-huge-file-with-bash
# seeks the requested line in a file with a fixed line length
# Usage: ./lineseek.py LINE FILE
# Example: ./lineseek 11723 data.txt
EXIT_SUCCESS = 0
EXIT_NOT_FOUND = 1
EXIT_OPT_ERR = 2
EXIT_FILE_ERR = 3
EXIT_DATA_ERR = 4
# could use a try block here
seekline = int(sys.argv[1])
file = sys.argv[2]
try:
if file == '-':
handle = sys.stdin
size = 0
else:
handle = open(file,'r')
except IOError as e:
print >> sys.stderr, ("File Open Error")
exit(EXIT_FILE_ERR)
try:
line = handle.readline()
lineend = handle.tell()
linelen = len(line)
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
# it would be really weird if this happened
if lineend != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
handle.seek(linelen * (seekline - 1))
try:
line = handle.readline()
except IOError as e:
print >> sys.stderr, ("File I/O Error")
exit(EXIT_FILE_ERR)
if len(line) != linelen:
print >> sys.stderr, ("Line length inconsistent")
exit(EXIT_DATA_ERR)
print(line)
Argument validation should be a lot better and there is room for many other improvements.
A quick perl one liner would work well for this too...
$ perl -ne 'if (YOURLINENUMBER..YOURLINENUMBER) {print $_; last;}' /path/to/your/file
精彩评论