efficiently splitting one file into several files by value of column
I have a tab-delimited text file that is very large. Many lines in the file have the same value for one of the columns in the file (call it column k). I want to separate this file into multiple files, putting entries with the same value of k in the same file. How can I do this? For example:
a foo
1 bar
c foo
2 bar
d foo
should be split into a file "foo" containing the entries "a foo" and "c foo" and "d foo" and a file called "bar" containing the entries "1 bar" and "2 bar".
how can I do this in either a shell script o开发者_Go百科r in Python?
thanks.
I'm not sure how efficient it is, but the quick and easy way is to take advantage of the way file redirection works in awk
:
awk '{ print >> $5 }' yourfile
That will append each line (unmodified) into a file named after column 5
. Adjust as necessary.
This should work per your spec
awk '{outFile=$2; print $0 > outFile}' BigManegyFile
Hope this helps.
After running both versions of the above awk commands (+ having awk error out) and seeing the request for a python version, I embarked on a short and not particularly arduous journey of writing a utility to easily split files based on keys.
Github repo: https://github.com/gstaubli/split_file_by_key
Background info: http://garrens.com/blog/2015/04/02/split-file-by-keys/
Awk error:
awk: 14 makes too many open files
input record number 4555369, file part-r-00000
source line number 1
精彩评论