In Powershell, what's the most efficient way to split a large text file by record type?

2022-12-18 05:32 问答作者：

I am using Powershell for some ETL work, reading compressed text files in and splitting them out depending on the first three characters of each line.

If I were just filtering the input file, I could pipe the filtered stream to Out-File and be done with it. But I ne开发者_如何学JAVAed to redirect the output to more than one destination, and as far as I know this can't be done with a simple pipe. I'm already using a .NET streamreader to read the compressed input files, and I'm wondering if I need to use a streamwriter to write the output files as well.

The naive version looks something like this:

while (!$reader.EndOfFile) {
  $line = $reader.ReadLine();
  switch ($line.substring(0,3) {
    "001" {Add-Content "output001.txt" $line}
    "002" {Add-Content "output002.txt" $line}
    "003" {Add-Content "output003.txt" $line}
    }
  }

That just looks like bad news: finding, opening, writing and closing a file once per row. The input files are huge 500MB+ monsters.

Is there an idiomatic way to handle this efficiently w/ Powershell constructs, or should I turn to the .NET streamwriter?

Are there methods of a (New-Item "path" -type "file") object I could use for this?

EDIT for context:

I'm using the DotNetZip library to read ZIP files as streams; thus streamreader rather than Get-Content/gc. Sample code:

[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll") 
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")

foreach ($entry in $zipfile) {
  $reader = new-object system.io.streamreader $entry.OpenReader();
  while (!$reader.EndOfFile) {
    $line = $reader.ReadLine();
    #do something here
  }
}

I should probably Dispose() of both the $zipfile and $reader, but that is for another question!

Reading

As for reading the file and parsing, I would go with switch statement:

switch -file c:\temp\stackoverflow.testfile2.txt -regex {
  "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
  "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
  "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
}

I think it is better approach because

there is support for regex, you don't have to make substring (which might be expensive) and
the parameter -file is quite handy ;)

Writing

As for writing the output, I'll test to use streamwriter, however if performance of Add-Content is decent for you, I would stick to it.

Added: Keith proposed to use >> operator, however, it seems that it is very slow. Besides that it writes output in Unicode which doubles the file size.

Look at my test:

[1]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c >> c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c >> c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c >> c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
159,1585874
[2]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
9,2696923

The difference is huge.

Just for comparison:

[3]: (measure-command {
>>     $reader = new-object io.streamreader c:\temp\stackoverflow.testfile2.txt
>>     while (!$reader.EndOfStream) {
>>         $line = $reader.ReadLine();
>>         switch ($line.substring(0,3)) {
>>             "001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $line}
>>             "002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $line}
>>             "003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $line}
>>             }
>>         }
>>     $reader.close()
>> }).TotalSeconds
>>
8,2454369
[4]: (measure-command {
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
>>         "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
>>         "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
>>     }
>> }).TotalSeconds
8,6755565

Added: I was curious about the writing performance .. and I was a little bit surprised

[8]: (measure-command {
>>     $sw1 = new-object io.streamwriter c:\temp\stackoverflow.testfile.001.txt3b
>>     $sw2 = new-object io.streamwriter c:\temp\stackoverflow.testfile.002.txt3b
>>     $sw3 = new-object io.streamwriter c:\temp\stackoverflow.testfile.003.txt3b
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {$sw1.WriteLine($_)}
>>         "^002" {$sw2.WriteLine($_)}
>>         "^003" {$sw3.WriteLine($_)}
>>     }
>>     $sw1.Close()
>>     $sw2.Close()
>>     $sw3.Close()
>>
>> }).TotalSeconds
>>
0,1062315

It is 80 times faster. Now you you have to decide - if speed is important, use StreamWriter. If code clarity is important, use Add-Content.

Substring vs. Regex

According to Keith Substring is 20% faster. It depends, as always. However, in my case the results are like this:

[102]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.s.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.s.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.s.txt}}}
>> }).TotalSeconds
>>
9,0654496
[103]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch -regex ($_) {
>>             '^001'{$c | Add-content c:\temp\stackoverflow.testfile.001.r.txt} `
>>             '^002'{$c | Add-content c:\temp\stackoverflow.testfile.002.r.txt} `
>>             '^003'{$c | Add-content c:\temp\stackoverflow.testfile.003.r.txt}}}
>> }).TotalSeconds
>>
9,2563681

So the difference is not important and for me, regexes are more readable.

Given the size of input files, you definitely want to process a line at a time. I wouldn't think the re-opening/closing of the output files would be too huge a perf hit. It certainly makes the implemation possible using the pipeline even as a one-liner - really not too different from your impl. I wrapped it here to get rid of the horizontal scrollbar:

gc foo.log | %{switch ($_.Substring(0,3)) {
    '001'{$input | out-file output001.txt -enc ascii -append} `
    '002'{$input | out-file output002.txt -enc ascii -append} `
    '003'{$input | out-file output003.txt -enc ascii -append}}}

继续阅读：text-files

In Powershell, what's the most efficient way to split a large text file by record type?

Reading

Writing

Substring vs. Regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Reading

Writing

Substring vs. Regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？