开发者

strip xml and html from a string

I have a string from which I need to strip all HTML and XML. I am not really good with regular expressions. For HTML I found some really useful code:

snippet = Regex.Replace(snippet, "<.*?>", "");

Currently I am doing this for XML:

while (snippet.IndexOf("<xml>") != -1)
            {
                int startLoc = snippet.IndexOf("<xml>");
                int endLoc = snippet.IndexOf("</xml>");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 6);
            }
            while (snippet.IndexOf("<style>") != -1)
            {
                int startLoc = snippet.IndexOf("<style>");
                int endLoc = snippet.IndexOf("</style>");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 8);
            }
            // only required for chrome and IE
            // removes - <object  classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id="ieooui">
            while (snippet.IndexOf("<object") != -1)
            {
                int startLoc = snippet.IndexOf("<object");
                int endLoc = snippet.IndexOf("id=\"ieooui\">");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 12);
            }
            // removes - <object id="ieooui" classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D">
            while (snippet.IndexOf("<object") != -1)
            {
                int startLoc = snippet.IndexOf("<object");
                int endLoc = snippet.IndexOf("classid=\"clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D\"");
                snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 52);
            }

Which is very untidy. Can some1 please suggest me a regular expressions for xml as well, particularly for:

<object id="ieooui" classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D">

and

<object  classid="clsid:38481807-CA0E-42开发者_运维问答D2-BF39-B33AF135CC4D" id="ieooui">

Thanks a ton.


In general you cannot parse HTML by regexp. Well, technically you can but as you say it will be "untidy". That task is usually made by using SAX parser. Or even without it by using HTML/XML tokenizer. Like this one http://www.codeproject.com/KB/recipes/HTML_XML_Scanner.aspx

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜