Indexing Huge text file

2023-02-23 06:41 问答作者：

I have one huge text file (over 100 gigs) with 开发者_如何学C6 columns of data (tab as separator). In first column I have integer value (2500 distinct values in set). I need to split this file into multiple smaller files depending on value in first column (note that rows are NOT sorted). Each of this smaller files will be used to prepare plot in matlab.

I have only 8 GB of ram.

The problem is how to do that efficiently? Any ideas?

Using bash:

cat 100gigfile | while read line; do
  intval="$( echo "$line" | cut -f 1)"
  chunkfile="$( printf '%010u.txt' "$intval" )"
  echo "$line" >> "$chunkfile"
done

That will split your 100 gig file into (as you say) 2500 individual files named according the value of the first field. You may have to adjust the format argument to printf to your taste.

one-liner with bash+awk:

awk '{print $0 >> $1".dat" }' 100gigfile

this will append every line of your large file to a file named as the first column's value + ".dat" extension, e.g. line 12 aa bb cc dd ee ff will go to the 12.dat file.

For linux 64 bit (I am not sure if it works for windows), you can mmap the file, and copy blocks to new files. I think that this would be most efficient way of doing it.

The most efficient way will be block by block, opening all files at once, and re-use the read buffer for writing. As the information is provided, there is no other pattern in the data that could be used to speed it up.

You will open each file in a different file descriptor to avoid opening and closing with each line. Open them all at the beginning or lazily as you go. Close them all before finishing up. Most linux distributions will allow only 1024 open files by default, so you will have to up the limit, say using ulimit -n 2600 given you have permission to do so (see also /etc/security/limits.conf).

Allocate a buffer, say a couple of kb, and raw read from the source file into it. Iterate though and keep control variables. Whenever you reach a endline or the end of the buffer write from the buffer into the correct file descriptor. There are a couple of edge cases you'll have to take into account, like when a read gets a newline but not enough to figure out which file should go into.

You could reverse-iterate to avoid processing the first few bytes of the buffer if you figure out the minimum line size. This will prove to be a bit trickier but a speed up nonetheless.

I wonder if non-blocking I/O takes care of problems such as this one.

The obvious solution is to open a new file every time you encounter a new value, and keep it open until the end. But your OS might not allow you to open 2500 files at once. So if you only have to do this once, you might do it this way:

Go through the file, building a list of all the values. Sort this list. (You don't need this step if you know in advance what the values will be.)
Set StartIndex to 0.
Open, say, 100 files (whatever your OS is comfortable with). These correspond to the next 100 values in the list, from list[StartIndex] to list[StartIndex+99].
Go through the file, otuputting those records with list[StartIndex] <= value <= list[StartIndex+99].
Close all the files.
Add 100 to StartIndex, and go to step 3 if you haven't finished.

So you need 26 passes through the file.

In your shell...

$ split -d -l <some number of lines> Foo Foo

That will split a large file Foo into Foo1 through FooN where n is determined by the number of lines in the original divided by the value you supply to -l. Iterate over the pieces in a loop...

EDIT... good point in the comment... this script (below) will read line by line, classify and assign to a file based on the first field...

#!/usr/bin/env python
import csv

prefix = 'filename'
reader = csv.reader(open('%s.csv' % prefix, 'r'))
suffix = 0
files = {}
# read one row at a time, classify on first field, and send to a file
# row[0] assumes csv reader does *not* split the line... if you make it do so,
# remove the [0] indexing (and strip()s) below
for row in reader:
    tmp = row[0].split('\t')
    fh = files.get(tmp[0].strip(), False)
    if not fh:
        fh = open('%s%05i.csv' % (prefix, suffix), 'a')
        files[tmp[0].strip()] = fh
        suffix += 1
    fh.write(row[0])

for key in files.keys():
    files[key].close()

继续阅读：windows

Indexing Huge text file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？