开发者

efficiently splitting one file into several files by value of column

I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file (call it column k). I want to separate this file into multiple files, putting entries with the same value of k in the same file. How can I do this? For example:

a foo
1 bar
c foo
2 bar
d foo

should be split into a file "foo" containing the entries "a foo" and "c foo" and "d foo" and a file called "bar" containing the entries "1 bar" and "2 bar".

how can I do this in either a shell script o开发者_Go百科r in Python?

thanks.


I'm not sure how efficient it is, but the quick and easy way is to take advantage of the way file redirection works in awk:

awk '{ print >> $5 }' yourfile

That will append each line (unmodified) into a file named after column 5. Adjust as necessary.


This should work per your spec

awk '{outFile=$2; print $0 > outFile}' BigManegyFile

Hope this helps.


After running both versions of the above awk commands (+ having awk error out) and seeing the request for a python version, I embarked on a short and not particularly arduous journey of writing a utility to easily split files based on keys.

Github repo: https://github.com/gstaubli/split_file_by_key

Background info: http://garrens.com/blog/2015/04/02/split-file-by-keys/

Awk error:

awk: 14 makes too many open files
 input record number 4555369, file part-r-00000
 source line number 1
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜