strip xml and html from a string
I have a string from which I need to strip all HTML and XML. I am not really good with regular expressions. For HTML I found some really useful code:
snippet = Regex.Replace(snippet, "<.*?>", "");
Currently I am doing this for XML:
while (snippet.IndexOf("<xml>") != -1)
{
int startLoc = snippet.IndexOf("<xml>");
int endLoc = snippet.IndexOf("</xml>");
snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 6);
}
while (snippet.IndexOf("<style>") != -1)
{
int startLoc = snippet.IndexOf("<style>");
int endLoc = snippet.IndexOf("</style>");
snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 8);
}
// only required for chrome and IE
// removes - <object classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id="ieooui">
while (snippet.IndexOf("<object") != -1)
{
int startLoc = snippet.IndexOf("<object");
int endLoc = snippet.IndexOf("id=\"ieooui\">");
snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 12);
}
// removes - <object id="ieooui" classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D">
while (snippet.IndexOf("<object") != -1)
{
int startLoc = snippet.IndexOf("<object");
int endLoc = snippet.IndexOf("classid=\"clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D\"");
snippet = snippet.Remove(startLoc, (endLoc - startLoc) + 52);
}
Which is very untidy. Can some1 please suggest me a regular expressions for xml as well, particularly for:
<object id="ieooui" classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D">
and
<object classid="clsid:38481807-CA0E-42开发者_运维问答D2-BF39-B33AF135CC4D" id="ieooui">
Thanks a ton.
In general you cannot parse HTML by regexp. Well, technically you can but as you say it will be "untidy". That task is usually made by using SAX parser. Or even without it by using HTML/XML tokenizer. Like this one http://www.codeproject.com/KB/recipes/HTML_XML_Scanner.aspx
精彩评论