Processing badly formatted XML in .NET3.5

2023-03-14 18:19 问答作者：

Given a third party system that streams XML to me via TCP. The TOTAL transmitted XML content (not one message of the stream, but concatenated messages) looks like this :

   <root>
      <in开发者_C百科sert ....><remark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</remark></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

Every line of the above sample is individually processable. Since it is a streaming process, I cannot just wait out until everything arrives, I have to process the content as it comes. The problem is the content chunks can be sliced by any point, no tags are respected. Do you have some good advice on how to process the content if it arrives in fragments like this?

Chunk 1:

  <root>
      <insert ....><rem

Chunk 2:

                      ark>...</remark></insert>
      <delete ....><remark>...</remark></delete>
      <insert ....><remark>...</rema

Chunk N:

                                    rk></insert>
      ....
      <insert ....><remark>...</remark></insert>
   </root>

EDIT:

While processing speed is not a concern (no realtime troubles), I cannot wait for the entire message. Practically the last chunk never arrives. The third party system sends messages whenever it encounters changes. The process never ends, it is a stream that never stops.

My first thought for this problem is to create a simple TextReader derivative that is responsible for buffering input from the stream. This class would then be used to feed an XmlReader. The TextReader derivative could fairly easily scan the incoming content looking for complete "blocks" of XML (a complete element with starting and ending brackets, a text fragment, a full attribute, etc.). It could also provide a flag to the calling code to indicate when one or more "blocks" are available so it can ask for the next XML node from the XmlReader, which would trigger sending that block from the TextReader derivative and removing it from the buffer.

Edit: Here's a quick and dirty example. I have no idea if it works perfectly (I haven't tested it), but it gets across the idea I was trying to convey.

public class StreamingXmlTextReader : TextReader
{
    private readonly Queue<string> _blocks = new Queue<string>();
    private string _buffer = String.Empty;
    private string _currentBlock = null;
    private int _currentPosition = 0;

    //Returns if there are blocks available and the XmlReader can go to the next XML node
    public bool AddFromStream(string content)
    {
        //Here is where we would can for simple blocks of XML
        //This simple chunking algorithm just uses a closing angle bracket
        //Not sure if/how well this will work in practice, but you get the idea
        _buffer = _buffer + content;
        int start = 0;
        int end = _buffer.IndexOf('>');
        while(end != -1)
        {
            _blocks.Enqueue(_buffer.Substring(start, end - start));
            start = end + 1;
            end = _buffer.IndexOf('>', start);
        }

        //Store the leftover if there is any
        _buffer = end < _buffer.Length
            ? _buffer.Substring(start, _buffer.Length - start) : String.Empty;

        return BlocksAvailable;
    }

    //Lets the caller know if any blocks are currently available, signaling the XmlReader can ask for another node
    public bool BlocksAvailable { get { return _blocks.Count > 0; } }

    public override int Read()
    {
        if (_currentBlock != null && _currentPosition < _currentBlock.Length - 1)
        {
            //Get the next character in this block
            return _currentBlock[_currentPosition++];
        }
        if(BlocksAvailable)
        {
            _currentBlock = _blocks.Dequeue();
            _currentPosition = 0;
            return _currentBlock[0];
        }
        return -1;
    }
}

After further investigation we figured out that the XML stream has been sliced up by the TCP buffer, whenever it got full. Therefore, slicing happened actually randomly in the byte stream causing cuts even inside unicode characters. Therefore, we had to assemble the parts on byte level and convert that back to text. Should converstion fail, we waited for the next byte chunk, and tried again.

继续阅读：.net non-well-formed streaming xml

Processing badly formatted XML in .NET3.5

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？