Splitting a large txt file into 200 smaller txt files on a regex using shell script in BASH
I hope the subject is clear enough, I haven't found anything specifically about this in the previously asked bin. I've tried implementing this in Perl or Python, but I think I may be trying too hard.
Is there a simple shell command / pipeline that will split my 4mb .txt file into seperate .txt files, based on a beginning and ending regex?
I provide a short sample of the file below.. so you can see that every "story" starts with the phrase "X of XXX DOCUMENTS", which could be used to split the file.
I think this should be easy and I'd be surprised if bash can't do it - faster than Per开发者_JAVA技巧l/Py.
Here it is:
1 of 999 DOCUMENTS
Copyright 2011 Virginian-Pilot Companies LLC
All Rights Reserved
The Virginian-Pilot(Norfolk, VA.)
...
3 of 999 DOCUMENTS
Copyright 2011 Canwest News Service
All Rights Reserved
Canwest News Service
...
Thanks in advance for all your help.
Ross
awk '/[0-9]+ of [0-9]+ DOCUMENTS/{g++} { print $0 > g".txt"}' file
OSX users will need
gawk
, as the builtinawk
will produce an error likeawk: illegal statement at source line 1
Ruby(1.9+)
#!/usr/bin/env ruby
g=1
f=File.open(g.to_s + ".txt","w")
open("file").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
g+=1
f=File.open(g.to_s + ".txt","w")
end
f.print line
end
As suggested in other solutions, you could use csplit
for that:
csplit csplit.test '/^\.\.\./' '{*}' && sed -i '/^\.\.\./d' xx*
I haven't found a better way to get rid of the reminiscent separator in the split files.
How hard did you try in Perl?
Edit Here is a faster method. It splits the file then prints the part files.
use strict;
use warnings;
my $count = 1;
open (my $file, '<', 'source.txt') or die "Can't open source.txt: $!";
for (split /(?=^.*\d+[^\S\n]*of[^\S\n]*\d+[^\S\n]*DOCUMENTS)/m, join('',<$file>))
{
if ( s/^.*(\d+)\s*of\s*\d+\s*DOCUMENTS.*(\n|$)//m )
{
open (my $part, '>', "Part$1_$count.txt")
or die "Can't open Part$1_$count for output: $!";
print $part $_;
close ($part);
$count++;
}
}
close ($file);
This is the line by line method:
use strict;
use warnings;
open (my $masterfile, '<', 'yourfilename.txt') or die "Can't open yourfilename.txt: $!";
my $count = 1;
my $fh;
while (<$masterfile>) {
if ( /(?<!\d)(\d+)\s*of\s*\d+\s*DOCUMENTS/ ) {
defined $fh and close ($fh);
open ($fh, '>', "Part$1_$count.txt") or die "Can't open Part$1_$count for output: $!";
$count++;
next;
}
defined $fh and print $fh $_;
}
defined $fh and close ($fh);
close ($masterfile);
regex to match "X of XXX DOCUMENTS" is
\d{1,3} of \d{1,3) DOCUMENTS
reading line by line and starting to write new file upon regex match should be fine.
Untested:
base=outputfile
start=1
pattern='^[[:blank:]]*[[:digit:]]+ OF [[:digit:]]+ DOCUMENTS[[:blank:]]*$
while read -r line
do
if [[ $line =~ $pattern ]]
then
((start++))
printf -v filecount '%4d' $start
>"$base$filecount" # create an empty file named like foo0001
fi
echo "$line" >> "$base$filecount"
done
精彩评论