Use regular expressions to remove HTML tags in Flex/AS3

2023-01-17 10:57 问答作者：

I'm writing a HTML parser in Flex (AS3) and I need to remove some HTML tags that are not needed.

For example, I want to remove the divs from this code:

           <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>
                          <p style="padding-left: 18px; padding-right: 20px; text-align: center;">
                            <span></span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: bold; text-decoration: none; font-family: Arial;">20% OFF.</span>
                            <span> </span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: normal; text-decoration: none; font-family: Arial;">Do it NOW!</span>
                            <span> </span>
                          </p>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>

and end with something like this:

                      <div>
                          <p style="padding-left: 18px; padding-right: 20px; text-align: center;">
                            <span></span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: bold; text-decoration: none; font-family: Arial;">20% OFF.</span>
                            <span> </span>
                            <span style=" font-size: 48px; color: #666666; font-style: normal; font-weight: normal; text-decoration: none; font-family: Arial;">Do it NOW!</span>
                            <span> </span>
                          </p>
                        </div>

My q开发者_如何学Gouestion is, how can I write a regular expression to remove these unwanted DIVs? Is there a better way to do it?

Thanks in advance.

You can't match arbitrarily nested constructs with a regular expression because nesting means irregularity. A parser (which you are writing) is the correct tool for this.

Now in this very special case, you could do a

result = subject.replace(/^\s*(<\/?div>)(?:\s*\1)*(?=\s*\1)/mg, "");

(which would simply remove all directly subsequent occurrences of <div> or </div> except the last one), but this is bad in so many ways that I'm afraid it will get me downvoted into oblivion.

To explain:

^           # match start of line
\s*         # match leading whitespace
(</?div>)   # match a <div> or </div>, remember which
(?:\s*\1)*  # match any further <div> or </div>, same one as before
(?=\s*\1)   # as long as there is another one right ahead

Can you count the ways in these this will fail? (Think comments, unmatched <div>s etc.)

Assuming that your target HTML is actually valid XML, you can use a recursive function to drag out the non-div bits.

static function grabNonDivContents(xml:XML):XMLList {
    var out:XMLList = new XMLList();
    var kids:XMLList = xml.children();
    for each (var kid:XML in kids) {
        if (kid.name() && kid.name() == "div") {
            var grandkids:XMLList = grabNonDivContents(kid);
            for each (var grandkid:XML in grandkids) {
                out += grandKid;
            }
        } else {
            out += kid;
        }
    }
    return out;
}

In my experience, parse complex html with regex only is hell. Regexes are quickly getting out of hand. It is much more robust to extract pieces of information you need (maybe with simple regexes) and assemble them back into simpler document.

继续阅读：actionscript-3 apache-flex regex

Use regular expressions to remove HTML tags in Flex/AS3

更多精彩内容

精彩评论

最新问答

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

哪里医院专治输卵管堵塞好？

外语基础薄弱的人出国自由行，带哪种翻译器比较好？？

输卵管积液手术价格？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？