开发者

Javascript RegEx wont work, but works in c# (atomic subexpression)

I have a regex tested in Expresso, works like a charm. But when I try to use it in javascript it gave an error. Firebug says:

invalid quantifier ?><div\b[^>]*>(?<DEPTH>)|<\/div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))<\/div>

the regex:

<div\b[^>]*&开发者_如何学Cgt;(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>

The regex matches nested html-divs such as:

<div id="foo"><div>blubb</div><div foobar>blubb</div></div>

Is the javascript regex only a subset?

edit: I have to strip the div's, including the text between them, away.

<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some
non html...

only the "some non html..." should stay. So I think I can't use any htmlparser?


Is the javascript regex only a subset?

No, they are different - there are a variety of Regular Expression engines out there, and they each have different features/quirks.

C# is has more features than JavaScript, but JS's one is not derived from C# so it isn't a subset.

Here's a couple of pages that document the differences:

  • http://www.regular-expressions.info/refflavors.html
  • http://www.regular-expressions.info/refext.html

And that whole website (regular-expressions.info) is well worth browsing to learn more about regex.


The regex matches nested html-divs

It probably doesn't, not in all cases.

And certainly it wont be possible for a single JS regex, since it doesn't support that depth stuff, amongst other things.

You're using the wrong tool for this job - parsing HTML should be done with a proper HTML parser/selector, then analysing the DOM to find the nested divs.

Anything that implements Sizzle should do (i.e jQuery, Dojo Toolkit, and others).

For example, something like jQuery('div:has(div)') or dojo.query('div:has(div)') or similar, should find nested divs (i.e. select all divs which have a div nested inside them), and will correctly cope with assorted quirks which can be complex if not impossible with a single regex.


edit: I have to strip the div's, including the text between them, away.
<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some non html...
only the "some non html..." should stay. So I think I can't use any htmlparser?

No - that is even more reason to use a HTML parser, and not attempt a messy regex hack.

jQuery('#foo div').remove()

That will remove all child DIVs, and leave the HTML text node in place.

Depending on your precise requirements, the selector might need changing, but this is absolutely a task for a tool that is designed to understand HTML.


Of course, todays javascript won't support atomic group and recursive regex, but you could easily build a quick&dirty solution by piecewise recursive stripping of tags from html source. If other solutions are too complicated and the structure of the documents is predictable, you could do sth. like:

 function stripme(tag, code)
{
 var strp = code;
 var regexp = new RegExp('<'+tag+'[^>]*?>(.*)</'+tag+'>');  // <- involves backtracking 
 while( strp.match(regexp) )            // every level of nesting will lead to
    strp = strp.replace(regexp, '');    // another loop invocation with the captured
 return strp;                           // contents (.*) of the level in RegExp.$1
}                                       // (if needed) 

This will work with, for example:

 var html ='<div id="foo"><div>blubb</div><div foobar>blubb</div></div>some non html...';

when invoked with, eg.:

 window.onload = function() { var stripped=stripme('div', html); alert(stripped); }

BTW, if possible, always use a DOM parser or Javascript library as recommended by Peter Boughton

Regards

rbo

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜