Large file handling - Read algorithm breaks - C#
So I've got an algorithm that reads from a (very large, ~155+ MB) binary file, parses it according to a spec and writes out the necessary info (to a CSV, flat text). It works flawlessly for the first 15.5 million lines of output, which produces a CSV file of ~0.99-1.03 GB. This gets through hardly over 20% of the binary file. After this it breaks, as in suddenly the printed data is not at all what is shown in the binary file. I checked the binary file, the same pattern continues (data split up into "packets" - see code below). Due to how it's handled, mem usage never really increases (steady ~15K). The functional code is listed below. Is it my algorithm (if so, why would it break after 15.5 million lines?!)... are there other implications I'm not considering due to the large file sizes? Any ideas?
(fyi: each "packet" is 77 bytes in length, beginning with a 3byte "startcode" and ending with a 5byte "endcode" - you'll see the pattern below)
edit code has been updated based on the suggestions below... thanks!
private void readBin(string theFile)
{
List<int> il = new List<int>();
bool readyForProcessing = false;
byte[] packet = new byte[77];
try
{
FileStream fs_bin = new FileStream(theFile, FileMode.Open);
BinaryReader br = new BinaryReader(fs_bin);
while (br.BaseStream.Position < br.BaseStream.Length && working)
{
// Find the first startcode
while (!readyForProcessing)
{
// If last byte of endcode adjacent to first byte of startcod...
// This never occurs outside of ending/starting s开发者_JAVA技巧o it's safe
if (br.ReadByte() == 0x0a && br.PeekChar() == (char)0x16)
readyForProcessing = true;
}
// Read a full packet of 77 bytes
br.Read(packet, 0, packet.Length);
// Unnecessary I guess now, but ensures packet begins
// with startcode and ends with endcode
if (packet.Take(3).SequenceEqual(STARTCODE) &&
packet.Skip(packet.Length - ENDCODE.Length).SequenceEqual(ENDCODE))
{
il.Add(BitConverter.ToUInt16(packet, 3)); //il.ElementAt(0) == 2byte id
il.Add(BitConverter.ToUInt16(packet, 5)); //il.ElementAt(1) == 2byte semistable
il.Add(packet[7]); //il.ElementAt(2) == 1byte constant
for(int i = 8; i < 72; i += 2) //start at 8th byte, get 64 bytes
il.Add(BitConverter.ToUInt16(packet, i));
for (int i = 3; i < 35; i++)
{
sw.WriteLine(il.ElementAt(0) + "," + il.ElementAt(1) +
"," + il.ElementAt(2) + "," + il.ElementAt(i));
}
il.Clear();
}
else
{
// Handle "bad" packets
}
} // while
fs_bin.Flush();
br.Close();
fs_bin.Close();
}
catch (Exception e)
{
MessageBox.Show(e.ToString());
}
}
Your code is silently catching any exception that happens in the while loop and swallowing it.
This is a bad practice because it masks issues like the one you are running into.
Most likely, one of the methods you call inside the loop (int.Parse()
for example) is throwing an exception because it encounters some problem in the format of the data (or your assumptions about that format).
Once an exception occurs, the loop that reads data is thrown off kilter because it is no longer positioned at a record boundary.
You should do several things to make this code more resilient:
- Don't silently swallow exception in the run loop. Deal with them.
- Don't read data byte by byte or field by field in the loop. Since your records are fixed size (77 bytes) - read an entire record into a byte[] and then process it from there. This will help ensure you are always reading at a record boundary.
- Don't put an empty generic
catch
block here and just silently catch and continue. You should check and see if you're getting an actual exception in there and go from there. - There is no need for the
byteToHexString
function. Just use the0x
prefix before a hexadecimal number and it will do a binary comparison.
i.e.
if(al[0] == 0x16 && al[1] == 0x3C && al[2] == 0x02)
{
...
}
- I don't know what your
doConvert
function does (you didn't provide that source), but theBinaryReader
class provides many different functions, one of which isReadInt16
. Unless yourshort
s are stored in an encoded format, that should be easier to use than doing your fairly obfuscated and confusing conversion. Even if they're encoded, it would still be far simpler to read thebyte
s in and manipulate them, rather than doing several roundtrips with converting to strings.
Edit
You appear to be making very liberal use of the LINQ extension methods (particularly ElementAt
). Every time you call that function, it enumerates your list until it reaches that number. You'll have much better performing code (as well as less verbose) if you just use the built-in indexer on the list.
i.e. al[3]
rather than al.ElementAt(3)
.
Also, you don't need to call Flush
on an input Stream
. Flush
is used to tell the stream to write anything that it has in its write buffer to the underlying OS file handle. For an input stream it won't do anything.
I would suggest replacing your current sw.WriteLine
call with this:
sw.WriteLine(BitConverter.ToString(packet));
and see if the data you're expecting on the row where it starts to mess up is actually what you're getting.
I would actually do this:
if (packet.Take(3).SequenceEqual(STARTCODE) &&
packet.Skip(packet.Length - ENDCODE.Length).SequenceEqual(ENDCODE))
{
ushort id = BitConverter.ToUInt16(packet, 3);
ushort semistable = BitConverter.ToUInt16(packet, 5);
byte contant = packet[7];
for(int i = 8; i < 72; i += 2)
{
il.Add(BitConverter.ToUInt16(packet, i));
}
foreach(ushort element in il)
{
sw.WriteLine(string.Format("{0},{1},{2},{3}", id, semistable, constant, element);
}
il.Clear();
}
else
{
//handle "bad" packets
}
精彩评论