regex to escape non-html tags' angle brackets
I have an html based text (with html tags), I want to find words that occur within angle brackets and replace the brackets with < and > or even when angle brackets are used as math symobls
e.g:
String text= "Hello, <b> Whatever <br /> <table> <tr> <td width="300px">
1 < 2 This is a <test> </td> </tr> </table>";
I want this to be :
Hello, <b> Whatever <开发者_开发百科br /> <table> <tr> <td width="300px">
1 < 2 This is a < test > </td> </tr> </table>
THANKS in advance
I would suggest you to use Html Cleaner
If you look at the HomePage the example shows exactly how text is escaped.
<td><a href=index.html>1 -> Home Page</a>
is converted in
<td>
<a href="index.html">1 -> Home Page</a>
</td>
it will normalize your html to conform to standard xHtml. I used it in the past and (IMHO) it's pretty solid and more reliable than jTidy&Co. (and of course it's better then use regex or replace strategies...)
Please see RegEx match open tags except XHTML self-contained tags and don't use regex to parse html. Use a SGML parser but don't use regex. It would fail to often. HTML isn't a regular language.
If it were not for CSS, Javascript, and CData sections, it would be possible.
If you are only dealing with a subset of HTML, you could make the assumption that angle brackets not surrounded by valid element identifier characters can be encoded.
Something like "<(?=[^A-Za-z_:0-9/])" -> "<" and "(?<=[^A-Za-z_:0-9/])>" -> ">"
But, unless you are generating the HTML yourself and KNOW that it has no embedded CSS, javascript, CData, or object sections...
As fraido said, don't use regular expressions for non-regular languages.
As everyone says, you shouldn't rely on Regular Expressions to parse HTML. They simply can't do it. But, in my case, I wanted to capture any angle brackets that didn't look like they were in an HTML tag, and escape them. Since everything was going through a sanitizer afterwards security wasn't a concern, and the results just needed to be good enough to catch most situations, not all.
You need a Regexp Library that supports zero-width lookahead assertions. In my case, that was Oniguruma in Ruby 1.8.
To match the less than symbols (<), I did:
/<(?!(/?[A-Za-z_:0-9]+\s?/?>))/
Matching the greater than (>) symbols is harder. Most libraries don't support zero-width lookbehind assertions of a variable length. So you cheat: reverse the string, run a lookahead assertion, and reverse it back afterwards, using the following pattern:
>(?!(/?\s?[A-Za-z_:0-9]+/?<))
So, my code looks a bit like:
match_less_than = Oniguruma::ORegexp.new('<(?!(/?[A-Za-z_:0-9]+\s?/?>))')
match_less_than.gsub!(string, '<')
match_greater_than = Oniguruma::ORegexp.new('>(?!(/?\s?[A-Za-z_:0-9]+/?<))')
string = match_greater_than.gsub(string.reverse, '>'.reverse).reverse
Nasty, huh?
精彩评论