开发者

Processing badly formatted XML in .NET3.5

Given a third party system that streams XML to me via TCP. The TOTAL transmitted XML content (not one message of the stream, but concatenated messages) looks like this :

   <root>
      <in开发者_C百科sert ....><remark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</remark></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

Every line of the above sample is individually processable. Since it is a streaming process, I cannot just wait out until everything arrives, I have to process the content as it comes. The problem is the content chunks can be sliced by any point, no tags are respected. Do you have some good advice on how to process the content if it arrives in fragments like this?

Chunk 1:

  <root>
      <insert ....><rem

Chunk 2:

                      ark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</rema

Chunk N:

                                    rk></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

EDIT:

While processing speed is not a concern (no realtime troubles), I cannot wait for the entire message. Practically the last chunk never arrives. The third party system sends messages whenever it encounters changes. The process never ends, it is a stream that never stops.


My first thought for this problem is to create a simple TextReader derivative that is responsible for buffering input from the stream. This class would then be used to feed an XmlReader. The TextReader derivative could fairly easily scan the incoming content looking for complete "blocks" of XML (a complete element with starting and ending brackets, a text fragment, a full attribute, etc.). It could also provide a flag to the calling code to indicate when one or more "blocks" are available so it can ask for the next XML node from the XmlReader, which would trigger sending that block from the TextReader derivative and removing it from the buffer.

Edit: Here's a quick and dirty example. I have no idea if it works perfectly (I haven't tested it), but it gets across the idea I was trying to convey.

public class StreamingXmlTextReader : TextReader
{
    private readonly Queue<string> _blocks = new Queue<string>();
    private string _buffer = String.Empty;
    private string _currentBlock = null;
    private int _currentPosition = 0;

    //Returns if there are blocks available and the XmlReader can go to the next XML node
    public bool AddFromStream(string content)
    {
        //Here is where we would can for simple blocks of XML
        //This simple chunking algorithm just uses a closing angle bracket
        //Not sure if/how well this will work in practice, but you get the idea
        _buffer = _buffer + content;
        int start = 0;
        int end = _buffer.IndexOf('>');
        while(end != -1)
        {
            _blocks.Enqueue(_buffer.Substring(start, end - start));
            start = end + 1;
            end = _buffer.IndexOf('>', start);
        }

        //Store the leftover if there is any
        _buffer = end < _buffer.Length
            ? _buffer.Substring(start, _buffer.Length - start) : String.Empty;

        return BlocksAvailable;
    }

    //Lets the caller know if any blocks are currently available, signaling the XmlReader can ask for another node
    public bool BlocksAvailable { get { return _blocks.Count > 0; } }

    public override int Read()
    {
        if (_currentBlock != null && _currentPosition < _currentBlock.Length - 1)
        {
            //Get the next character in this block
            return _currentBlock[_currentPosition++];
        }
        if(BlocksAvailable)
        {
            _currentBlock = _blocks.Dequeue();
            _currentPosition = 0;
            return _currentBlock[0];
        }
        return -1;
    }
}


After further investigation we figured out that the XML stream has been sliced up by the TCP buffer, whenever it got full. Therefore, slicing happened actually randomly in the byte stream causing cuts even inside unicode characters. Therefore, we had to assemble the parts on byte level and convert that back to text. Should converstion fail, we waited for the next byte chunk, and tried again.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜