Javascript RegEx wont work, but works in c# (atomic subexpression)
I have a regex tested in Expresso, works like a charm. But when I try to use it in javascript it gave an error. Firebug says:
invalid quantifier ?><div\b[^>]*>(?<DEPTH>)|<\/div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))<\/div>
the regex:
<div\b[^>]*&开发者_如何学Cgt;(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>
The regex matches nested html-divs such as:
<div id="foo"><div>blubb</div><div foobar>blubb</div></div>
Is the javascript regex only a subset?
edit: I have to strip the div's, including the text between them, away.
<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some
non html...
only the "some non html..." should stay. So I think I can't use any htmlparser?
Is the javascript regex only a subset?
No, they are different - there are a variety of Regular Expression engines out there, and they each have different features/quirks.
C# is has more features than JavaScript, but JS's one is not derived from C# so it isn't a subset.
Here's a couple of pages that document the differences:
- http://www.regular-expressions.info/refflavors.html
- http://www.regular-expressions.info/refext.html
And that whole website (regular-expressions.info) is well worth browsing to learn more about regex.
The regex matches nested html-divs
It probably doesn't, not in all cases.
And certainly it wont be possible for a single JS regex, since it doesn't support that depth stuff, amongst other things.
You're using the wrong tool for this job - parsing HTML should be done with a proper HTML parser/selector, then analysing the DOM to find the nested divs.
Anything that implements Sizzle should do (i.e jQuery, Dojo Toolkit, and others).
For example, something like jQuery('div:has(div)')
or dojo.query('div:has(div)')
or similar, should find nested divs (i.e. select all divs which have a div nested inside them), and will correctly cope with assorted quirks which can be complex if not impossible with a single regex.
edit: I have to strip the div's, including the text between them, away.
<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some non html...
only the "some non html..." should stay. So I think I can't use any htmlparser?
No - that is even more reason to use a HTML parser, and not attempt a messy regex hack.
jQuery('#foo div').remove()
That will remove all child DIVs, and leave the HTML text node in place.
Depending on your precise requirements, the selector might need changing, but this is absolutely a task for a tool that is designed to understand HTML.
Of course, todays javascript won't support atomic group and recursive regex, but you could easily build a quick&dirty solution by piecewise recursive stripping of tags from html source. If other solutions are too complicated and the structure of the documents is predictable, you could do sth. like:
function stripme(tag, code)
{
var strp = code;
var regexp = new RegExp('<'+tag+'[^>]*?>(.*)</'+tag+'>'); // <- involves backtracking
while( strp.match(regexp) ) // every level of nesting will lead to
strp = strp.replace(regexp, ''); // another loop invocation with the captured
return strp; // contents (.*) of the level in RegExp.$1
} // (if needed)
This will work with, for example:
var html ='<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some non html...';
when invoked with, eg.:
window.onload = function() { var stripped=stripme('div', html); alert(stripped); }
BTW, if possible, always use a DOM parser or Javascript library as recommended by Peter Boughton
Regards
rbo
精彩评论