开发者

Indexing Huge text file

I have one huge text file (over 100 gigs) with 开发者_如何学C6 columns of data (tab as separator). In first column I have integer value (2500 distinct values in set). I need to split this file into multiple smaller files depending on value in first column (note that rows are NOT sorted). Each of this smaller files will be used to prepare plot in matlab.

I have only 8 GB of ram.

The problem is how to do that efficiently? Any ideas?


Using bash:

cat 100gigfile | while read line; do
  intval="$( echo "$line" | cut -f 1)"
  chunkfile="$( printf '%010u.txt' "$intval" )"
  echo "$line" >> "$chunkfile"
done

That will split your 100 gig file into (as you say) 2500 individual files named according the value of the first field. You may have to adjust the format argument to printf to your taste.


one-liner with bash+awk:

awk '{print $0 >> $1".dat" }' 100gigfile 

this will append every line of your large file to a file named as the first column's value + ".dat" extension, e.g. line 12 aa bb cc dd ee ff will go to the 12.dat file.


For linux 64 bit (I am not sure if it works for windows), you can mmap the file, and copy blocks to new files. I think that this would be most efficient way of doing it.


The most efficient way will be block by block, opening all files at once, and re-use the read buffer for writing. As the information is provided, there is no other pattern in the data that could be used to speed it up.

You will open each file in a different file descriptor to avoid opening and closing with each line. Open them all at the beginning or lazily as you go. Close them all before finishing up. Most linux distributions will allow only 1024 open files by default, so you will have to up the limit, say using ulimit -n 2600 given you have permission to do so (see also /etc/security/limits.conf).

Allocate a buffer, say a couple of kb, and raw read from the source file into it. Iterate though and keep control variables. Whenever you reach a endline or the end of the buffer write from the buffer into the correct file descriptor. There are a couple of edge cases you'll have to take into account, like when a read gets a newline but not enough to figure out which file should go into.

You could reverse-iterate to avoid processing the first few bytes of the buffer if you figure out the minimum line size. This will prove to be a bit trickier but a speed up nonetheless.

I wonder if non-blocking I/O takes care of problems such as this one.


The obvious solution is to open a new file every time you encounter a new value, and keep it open until the end. But your OS might not allow you to open 2500 files at once. So if you only have to do this once, you might do it this way:

  1. Go through the file, building a list of all the values. Sort this list. (You don't need this step if you know in advance what the values will be.)
  2. Set StartIndex to 0.
  3. Open, say, 100 files (whatever your OS is comfortable with). These correspond to the next 100 values in the list, from list[StartIndex] to list[StartIndex+99].
  4. Go through the file, otuputting those records with list[StartIndex] <= value <= list[StartIndex+99].
  5. Close all the files.
  6. Add 100 to StartIndex, and go to step 3 if you haven't finished.

So you need 26 passes through the file.


In your shell...

$ split -d -l <some number of lines> Foo Foo

That will split a large file Foo into Foo1 through FooN where n is determined by the number of lines in the original divided by the value you supply to -l. Iterate over the pieces in a loop...

EDIT... good point in the comment... this script (below) will read line by line, classify and assign to a file based on the first field...

#!/usr/bin/env python
import csv

prefix = 'filename'
reader = csv.reader(open('%s.csv' % prefix, 'r'))
suffix = 0
files = {}
# read one row at a time, classify on first field, and send to a file
# row[0] assumes csv reader does *not* split the line... if you make it do so,
# remove the [0] indexing (and strip()s) below
for row in reader:
    tmp = row[0].split('\t')
    fh = files.get(tmp[0].strip(), False)
    if not fh:
        fh = open('%s%05i.csv' % (prefix, suffix), 'a')
        files[tmp[0].strip()] = fh
        suffix += 1
    fh.write(row[0])

for key in files.keys():
    files[key].close()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜