Large file handling - Read algorithm breaks - C#

2022-12-09 09:15 问答作者：

So I've got an algorithm that reads from a (very large, ~155+ MB) binary file, parses it according to a spec and writes out the necessary info (to a CSV, flat text). It works flawlessly for the first 15.5 million lines of output, which produces a CSV file of ~0.99-1.03 GB. This gets through hardly over 20% of the binary file. After this it breaks, as in suddenly the printed data is not at all what is shown in the binary file. I checked the binary file, the same pattern continues (data split up into "packets" - see code below). Due to how it's handled, mem usage never really increases (steady ~15K). The functional code is listed below. Is it my algorithm (if so, why would it break after 15.5 million lines?!)... are there other implications I'm not considering due to the large file sizes? Any ideas?

(fyi: each "packet" is 77 bytes in length, beginning with a 3byte "startcode" and ending with a 5byte "endcode" - you'll see the pattern below)

edit code has been updated based on the suggestions below... thanks!

private void readBin(string theFile)
{
    List<int> il = new List<int>();
    bool readyForProcessing = false;

    byte[] packet = new byte[77];

    try
    {
        FileStream fs_bin = new FileStream(theFile, FileMode.Open);
        BinaryReader br = new BinaryReader(fs_bin);

        while (br.BaseStream.Position < br.BaseStream.Length && working)
        {
            // Find the first startcode
            while (!readyForProcessing)
            {
                // If last byte of endcode adjacent to first byte of startcod...
                // This never occurs outside of ending/starting s开发者_JAVA技巧o it's safe
                if (br.ReadByte() == 0x0a && br.PeekChar() == (char)0x16)
                    readyForProcessing = true;
            }

            // Read a full packet of 77 bytes
            br.Read(packet, 0, packet.Length);

            // Unnecessary I guess now, but ensures packet begins
            // with startcode and ends with endcode
            if (packet.Take(3).SequenceEqual(STARTCODE) &&
                packet.Skip(packet.Length - ENDCODE.Length).SequenceEqual(ENDCODE))
            {
                il.Add(BitConverter.ToUInt16(packet, 3)); //il.ElementAt(0) == 2byte id
                il.Add(BitConverter.ToUInt16(packet, 5)); //il.ElementAt(1) == 2byte semistable
                il.Add(packet[7]); //il.ElementAt(2) == 1byte constant

                for(int i = 8; i < 72; i += 2) //start at 8th byte, get 64 bytes
                    il.Add(BitConverter.ToUInt16(packet, i));

                for (int i = 3; i < 35; i++)
                {
                    sw.WriteLine(il.ElementAt(0) + "," + il.ElementAt(1) +
                        "," + il.ElementAt(2) + "," + il.ElementAt(i));
                }

                il.Clear();
            }
            else
            {
                // Handle "bad" packets
            }
        } // while

        fs_bin.Flush();
        br.Close();                
        fs_bin.Close();
    }
    catch (Exception e)
    {
        MessageBox.Show(e.ToString());
    }
}

Your code is silently catching any exception that happens in the while loop and swallowing it.

This is a bad practice because it masks issues like the one you are running into.

Most likely, one of the methods you call inside the loop (int.Parse() for example) is throwing an exception because it encounters some problem in the format of the data (or your assumptions about that format).

Once an exception occurs, the loop that reads data is thrown off kilter because it is no longer positioned at a record boundary.

You should do several things to make this code more resilient:

Don't silently swallow exception in the run loop. Deal with them.
Don't read data byte by byte or field by field in the loop. Since your records are fixed size (77 bytes) - read an entire record into a byte[] and then process it from there. This will help ensure you are always reading at a record boundary.

Don't put an empty generic catch block here and just silently catch and continue. You should check and see if you're getting an actual exception in there and go from there.
There is no need for the byteToHexString function. Just use the 0x prefix before a hexadecimal number and it will do a binary comparison.

i.e.

if(al[0] == 0x16 && al[1] == 0x3C && al[2] == 0x02)
{
    ...
}

I don't know what your doConvert function does (you didn't provide that source), but the BinaryReader class provides many different functions, one of which is ReadInt16. Unless your shorts are stored in an encoded format, that should be easier to use than doing your fairly obfuscated and confusing conversion. Even if they're encoded, it would still be far simpler to read the bytes in and manipulate them, rather than doing several roundtrips with converting to strings.

Edit

You appear to be making very liberal use of the LINQ extension methods (particularly ElementAt). Every time you call that function, it enumerates your list until it reaches that number. You'll have much better performing code (as well as less verbose) if you just use the built-in indexer on the list.

i.e. al[3] rather than al.ElementAt(3).

Also, you don't need to call Flush on an input Stream. Flush is used to tell the stream to write anything that it has in its write buffer to the underlying OS file handle. For an input stream it won't do anything.

I would suggest replacing your current sw.WriteLine call with this:

sw.WriteLine(BitConverter.ToString(packet)); and see if the data you're expecting on the row where it starts to mess up is actually what you're getting.

I would actually do this:

if (packet.Take(3).SequenceEqual(STARTCODE) &&
    packet.Skip(packet.Length - ENDCODE.Length).SequenceEqual(ENDCODE))
{
    ushort id = BitConverter.ToUInt16(packet, 3);
    ushort semistable = BitConverter.ToUInt16(packet, 5);
    byte contant = packet[7];

    for(int i = 8; i < 72; i += 2)
    {
        il.Add(BitConverter.ToUInt16(packet, i));
    }

    foreach(ushort element in il)
    {
        sw.WriteLine(string.Format("{0},{1},{2},{3}", id, semistable, constant, element);
    }

    il.Clear();
}
else
{
    //handle "bad" packets
}

继续阅读：.net large-files

Large file handling - Read algorithm breaks - C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？