开发者

I need to remove Java Script tags using regular expressions and JRegex

I need to remove all the Java Script tags and the content in between and style tags from the HTML code of web pages.So far I've come up with this expression :

"(<[ \r\n\t]*script([ \r\n\t>]|>){1,}([ \r\n\t]|.)*?</[ \r\n\t]*script[ \r\n\t]*>)|(<[ \r\n\t]*noscript([ \r\n\t>]|>){1,}([ \r\n\t]|.)*?</[ \r\n\t]*noscript[ \r\n\t]*>)|(<[ \r\n\t]*style([ \r\n\t>]|>){1,}([ \r\n\t]|.)*?</[ \r\n\t]*style[ \r\n\t]*>)"

I use JRegex library to work with regular expressions. When I test it in any regex tester it works just fine, but once I run my program - it all crashes down with this error report:

Exception in thread "Thread-0" java.lang.StackOverflowError
    at java.util.regex.Pattern$BranchConn.match(Unknown Source)
    at java.util.regex.Pattern$BmpCharProperty.match(Unknown Source)
    at java.util.regex.Pattern$Branch.match(Unknown Source)
    at java.util.regex.Pattern$GroupHead.match(Unknown Source)
    at java.util.regex.Pattern$LazyLoop.match(Unknown Source)
    at java.util.regex.Pattern$GroupTail.match(Unknown Source)
    at java.util.regex.Pattern$BranchConn.match(Unknown Source)
    at j开发者_运维百科ava.util.regex.Pattern$CharProperty.match(Unknown Source)
    at java.util.regex.Pattern$Branch.match(Unknown Source)
    at java.util.regex.Pattern$GroupHead.match(Unknown Source)
    at java.util.regex.Pattern$LazyLoop.match(Unknown Source)
..................................

And it keeps on going forever. If anyone can give me an advice on this one - I'll be very grateful.


Why not use an HTML parser and just remove the <script> and <style> nodes?


I can resolve this problem. I use a regular expression for remove script tags and the content in between script tags using this regular expression:

@"<(script|SCRIPT)[^+]*?>[^>]*?<(/\script|SCRIPT)>"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜