开发者

How do I split a file into n no of parts

I have a file contining some no of lines. I want split file into n no.of files with particular names. It doesn't matter how many line present in each file. I just want particular no.of files (say 5). here the problem is the no of lines in the original file keep on changing. So I need to calculate no of lines then just split the files into 5 parts. If possible we have to se开发者_高级运维nd each of them into different directories.


In bash, you can use the split command to split it based on number of lines desired. You can use wc command to figure out how many lines are desired. Here's wc combined with with split into one line.

For example, to split onepiece.log into 5 parts

    split -l$((`wc -l < onepiece.log`/5)) onepiece.log onepiece.split.log -da 4

This will create files like onepiece.split.log0000 ...

Note: bash division rounds down, so if there is a remainder there will be a 6th part file.


On linux, there is a split command,

split --lines=1m /path/to/large/file /path/to/output/file/prefix

Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT is -, read standard input.

...

-l, --lines=NUMBER put NUMBER lines per output file

...

You would have to calculate the actual size of the splits beforehand, though.


split has an option "--number=CHUNKS" that lets you divide a file into a number of chunks. This is from the (trimmed) output of "split --help":

  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below

...

CHUNKS may be:
N       split into N files based on size of input
K/N     output Kth of N to stdout
l/N     split into N files without splitting lines
l/K/N   output Kth of N to stdout without splitting lines
r/N     like 'l' but use round robin distribution
r/K/N   likewise but only output Kth of N to stdout

In the case of splitting it into 5 parts, the command would be: split --number=l/5 inputfile outputprefix

This might not result in them having the same number of lines, though.

If you want them all to have the same number of lines up until the last one, you can use the following command: split -l $(( ($(cat "inputfile" | wc -l) + 5 - 1)/5 )) inputfile outputprefix Both 5s here can be replaced with any other number (making sure they're the same).

Here's an explanation of this command piece by piece:

$( ) returns the output of whatever command you put into it. cat is used here to make sure wc only returns the number of lines without also outputting the input filename.

$(( )) evaluates whatever you put between the parentheses as a mathematical expression (using only integers) and returns the result.

($(cat "inputfile" | wc -l) + 5 - 1)/5 takes the line count of the input file and adds 5, subtracts 1, and divides the result by 5. The addition and subtraction before division makes sure the result is rounded up so that it gives exactly the number of parts you want (5 in this case).

You can also use split --number=r/5 to split it into four files where each line is distributed between them as in the following example:

inputfile.txt:
1
2
3
4
5
6
7
8
9

outputfile1:
1
6

outputfile2:
2
7

outputfile3:
3
8

outputfile4:
4
9

outputfile5:
5

This doesn't preserve the file order. but it can be useful in cases where that isn't important.


Assuming you are processing a text file then wc -l to determine the total number of lines and split -l to split into a specified number of lines (total / 5 in your case). This works on UNIX/Mac and Windows (if you have cygwin installed)


On macOS you can simply do:

split -n <number_of_parts> <filename>

For example, you can do

split -n 5 file.txt

And it will be split in 5 files with similar number of lines.


This is building on the original answers given by @sketchytechky and @grasshopper. If you would like to deal with remainders differently and want a fixed number of files as output but with a round robin distribution of lines, then the split command should be written as:

split -da 4 -n r/1024 filename filename_split --additional-suffix=".log". Replace 1024 with the number of files you want as output.


here's a oneliner with variables

file=onepiece.log; nsplit=5; len=$(wc -l < $file); split -l$(($len/$nsplit)) "$file" "$file.split" -da 4


I can think of a few ways to do it. Which you would use depends a lot on the data.

  1. Lines are fixed length: Find the size of the file by reading it's directory entry and divide by the line length to get the number of lines. Use this to determine how many lines per file.

  2. The files only need to have approximately the same number of lines. Again read the file size from the directory entry. Read the first N lines (N should be small but some reasonable fraction of the file) to calculate an average line length. Calculate the approximate number of lines based on the file size and predicted average line length. This assumes that the line length follows a normal distribution. If not, adjust your method to randomly sample lines (using seek() or something similar). Rewind the file after your have your average, then split it based on the predicted line length.

  3. Read the file twice. The first time count the number of lines. The second time splitting the file into the requisite pieces.

EDIT: Using a shell script (according to your comments), the randomized version of #2 would be hard unless you wrote a small program to do that for you. You should be able to use ls -l to get the file size, wc -l to count the exact number of lines, and head -nNNN | wc -c to calculate the average line length.


linux, split -n l/5 -da 2 test.txt

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜