Regex to remove HTML-head-tag
how can I remove, with NSRegularExpression, the entire head-开发者_StackOverflow社区tag in a HTML file. Can some one give me a regex?
Thanks in advance, Ph99Ph
There is none! HTML is a type-2 language and thus not parsable with a regular expression (type-3).
See this wiki article in case of doubt.
Lots of people use regex for parsing/editing HTML. This works quite well in simple cases but is utterly error prone.
This being said: You should have fairly reliable results with this regex:
<head>.+?</head>
This requires "." to also match line breaks. If it doesn't, then use this:
<head>(?:.|\n|\r)+?</head>
Again: This is error prone, don't do it.
What you should use is an XML parser such as NSXMLParser
.
Please see the accepted answer at RegEx match open tags except XHTML self-contained tags. Or any version of this exact same question posted each day since the beginning of Stack Overflow.
In short, you cannot reliably parse HTML with Regular Expressions. RegEx is simply not advanced enough because of the complexities of HTML.
use something like this :
result = System.Text.RegularExpressions.Regex.Replace(result,
@"<( )*head([^>])*>", "<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
@"(<( )*(/)( )*head( )*>)", "</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)", " ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
精彩评论