Processing badly formatted XML in .NET3.5
Given a third party system that streams XML to me via TCP. The TOTAL transmitted XML content (not one message of the stream, but concatenated messages) looks like this :
<root>
<in开发者_C百科sert ....><remark>...</remark></insert>
<delete ....><remark>...</remark></delete>
<insert ....><remark>...</remark></insert>
....
<insert ....><remark>...</remark></insert>
</root>
Every line of the above sample is individually processable. Since it is a streaming process, I cannot just wait out until everything arrives, I have to process the content as it comes. The problem is the content chunks can be sliced by any point, no tags are respected. Do you have some good advice on how to process the content if it arrives in fragments like this?
Chunk 1:
<root>
<insert ....><rem
Chunk 2:
ark>...</remark></insert>
<delete ....><remark>...</remark></delete>
<insert ....><remark>...</rema
Chunk N:
rk></insert>
....
<insert ....><remark>...</remark></insert>
</root>
EDIT:
While processing speed is not a concern (no realtime troubles), I cannot wait for the entire message. Practically the last chunk never arrives. The third party system sends messages whenever it encounters changes. The process never ends, it is a stream that never stops.
My first thought for this problem is to create a simple TextReader derivative that is responsible for buffering input from the stream. This class would then be used to feed an XmlReader. The TextReader derivative could fairly easily scan the incoming content looking for complete "blocks" of XML (a complete element with starting and ending brackets, a text fragment, a full attribute, etc.). It could also provide a flag to the calling code to indicate when one or more "blocks" are available so it can ask for the next XML node from the XmlReader, which would trigger sending that block from the TextReader derivative and removing it from the buffer.
Edit: Here's a quick and dirty example. I have no idea if it works perfectly (I haven't tested it), but it gets across the idea I was trying to convey.
public class StreamingXmlTextReader : TextReader
{
private readonly Queue<string> _blocks = new Queue<string>();
private string _buffer = String.Empty;
private string _currentBlock = null;
private int _currentPosition = 0;
//Returns if there are blocks available and the XmlReader can go to the next XML node
public bool AddFromStream(string content)
{
//Here is where we would can for simple blocks of XML
//This simple chunking algorithm just uses a closing angle bracket
//Not sure if/how well this will work in practice, but you get the idea
_buffer = _buffer + content;
int start = 0;
int end = _buffer.IndexOf('>');
while(end != -1)
{
_blocks.Enqueue(_buffer.Substring(start, end - start));
start = end + 1;
end = _buffer.IndexOf('>', start);
}
//Store the leftover if there is any
_buffer = end < _buffer.Length
? _buffer.Substring(start, _buffer.Length - start) : String.Empty;
return BlocksAvailable;
}
//Lets the caller know if any blocks are currently available, signaling the XmlReader can ask for another node
public bool BlocksAvailable { get { return _blocks.Count > 0; } }
public override int Read()
{
if (_currentBlock != null && _currentPosition < _currentBlock.Length - 1)
{
//Get the next character in this block
return _currentBlock[_currentPosition++];
}
if(BlocksAvailable)
{
_currentBlock = _blocks.Dequeue();
_currentPosition = 0;
return _currentBlock[0];
}
return -1;
}
}
After further investigation we figured out that the XML stream has been sliced up by the TCP buffer, whenever it got full. Therefore, slicing happened actually randomly in the byte stream causing cuts even inside unicode characters. Therefore, we had to assemble the parts on byte level and convert that back to text. Should converstion fail, we waited for the next byte chunk, and tried again.
精彩评论