In Powershell, what's the most efficient way to split a large text file by record type?
I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line.
If I were just filtering the input file, I could pipe the filtered stream to Out-File and be done with it. But I ne开发者_如何学JAVAed to redirect the output to more than one destination, and as far as I know this can't be done with a simple pipe. I'm already using a .NET streamreader to read the compressed input files, and I'm wondering if I need to use a streamwriter to write the output files as well.
The naive version looks something like this:
while (!$reader.EndOfFile) {
$line = $reader.ReadLine();
switch ($line.substring(0,3) {
"001" {Add-Content "output001.txt" $line}
"002" {Add-Content "output002.txt" $line}
"003" {Add-Content "output003.txt" $line}
}
}
That just looks like bad news: finding, opening, writing and closing a file once per row. The input files are huge 500MB+ monsters.
Is there an idiomatic way to handle this efficiently w/ Powershell constructs, or should I turn to the .NET streamwriter?
Are there methods of a (New-Item "path" -type "file") object I could use for this?
EDIT for context:
I'm using the DotNetZip library to read ZIP files as streams; thus streamreader
rather than Get-Content
/gc
. Sample code:
[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll")
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")
foreach ($entry in $zipfile) {
$reader = new-object system.io.streamreader $entry.OpenReader();
while (!$reader.EndOfFile) {
$line = $reader.ReadLine();
#do something here
}
}
I should probably Dispose()
of both the $zipfile and $reader, but that is for another question!
Reading
As for reading the file and parsing, I would go with switch
statement:
switch -file c:\temp\stackoverflow.testfile2.txt -regex {
"^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
"^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
"^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
}
I think it is better approach because
- there is support for regex, you don't have to make substring (which might be expensive) and
- the parameter
-file
is quite handy ;)
Writing
As for writing the output, I'll test to use streamwriter, however if performance of Add-Content
is decent for you, I would stick to it.
Added:
Keith proposed to use >>
operator, however, it seems that it is very slow. Besides that it writes output in Unicode which doubles the file size.
Look at my test:
[1]: (measure-command {
>> gc c:\temp\stackoverflow.testfile2.txt | %{$c = $_; switch ($_.Substring(0,3)) {
>> '001'{$c >> c:\temp\stackoverflow.testfile.001.txt} `
>> '002'{$c >> c:\temp\stackoverflow.testfile.002.txt} `
>> '003'{$c >> c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
159,1585874
[2]: (measure-command {
>> gc c:\temp\stackoverflow.testfile2.txt | %{$c = $_; switch ($_.Substring(0,3)) {
>> '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.txt} `
>> '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.txt} `
>> '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
9,2696923
The difference is huge.
Just for comparison:
[3]: (measure-command {
>> $reader = new-object io.streamreader c:\temp\stackoverflow.testfile2.txt
>> while (!$reader.EndOfStream) {
>> $line = $reader.ReadLine();
>> switch ($line.substring(0,3)) {
>> "001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $line}
>> "002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $line}
>> "003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $line}
>> }
>> }
>> $reader.close()
>> }).TotalSeconds
>>
8,2454369
[4]: (measure-command {
>> switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>> "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
>> "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
>> "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
>> }
>> }).TotalSeconds
8,6755565
Added: I was curious about the writing performance .. and I was a little bit surprised
[8]: (measure-command {
>> $sw1 = new-object io.streamwriter c:\temp\stackoverflow.testfile.001.txt3b
>> $sw2 = new-object io.streamwriter c:\temp\stackoverflow.testfile.002.txt3b
>> $sw3 = new-object io.streamwriter c:\temp\stackoverflow.testfile.003.txt3b
>> switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>> "^001" {$sw1.WriteLine($_)}
>> "^002" {$sw2.WriteLine($_)}
>> "^003" {$sw3.WriteLine($_)}
>> }
>> $sw1.Close()
>> $sw2.Close()
>> $sw3.Close()
>>
>> }).TotalSeconds
>>
0,1062315
It is 80 times faster.
Now you you have to decide - if speed is important, use StreamWriter
. If code clarity is important, use Add-Content
.
Substring vs. Regex
According to Keith Substring is 20% faster. It depends, as always. However, in my case the results are like this:
[102]: (measure-command {
>> gc c:\temp\stackoverflow.testfile2.txt | %{$c = $_; switch ($_.Substring(0,3)) {
>> '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.s.txt} `
>> '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.s.txt} `
>> '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.s.txt}}}
>> }).TotalSeconds
>>
9,0654496
[103]: (measure-command {
>> gc c:\temp\stackoverflow.testfile2.txt | %{$c = $_; switch -regex ($_) {
>> '^001'{$c | Add-content c:\temp\stackoverflow.testfile.001.r.txt} `
>> '^002'{$c | Add-content c:\temp\stackoverflow.testfile.002.r.txt} `
>> '^003'{$c | Add-content c:\temp\stackoverflow.testfile.003.r.txt}}}
>> }).TotalSeconds
>>
9,2563681
So the difference is not important and for me, regexes are more readable.
Given the size of input files, you definitely want to process a line at a time. I wouldn't think the re-opening/closing of the output files would be too huge a perf hit. It certainly makes the implemation possible using the pipeline even as a one-liner - really not too different from your impl. I wrapped it here to get rid of the horizontal scrollbar:
gc foo.log | %{switch ($_.Substring(0,3)) {
'001'{$input | out-file output001.txt -enc ascii -append} `
'002'{$input | out-file output002.txt -enc ascii -append} `
'003'{$input | out-file output003.txt -enc ascii -append}}}
精彩评论