Need help with modifying a function (regex)

2023-02-16 04:20 问答作者：

I am using the parse_array function from the book Webbots, Spiders, and Screen Scrapers for my parsing needs. However I need to modify this function little, and I don't know how to.

The function:

    function parse_array($string, $beg_tag, $close_tag)
    {
    preg_match_all("($beg_tag(.*)$close_tag)siU", $string, $matching_data);
    return $matching_data[0];
    }

How it works:

    $html="<div>
           afterfirst
            <div>nested</div>
           this is lost
           </div>
           <div>div2</div>" ;

    $div_array =  parse_array($html,"<div", "</div>") ;
    echo $div_array[0]. "</br>" ;
    //outputs:
    <div>
    afterfirst
    <div>nested</div>
    //the line "this is lost" and the last </div> isn't included.

Basically the function开发者_JAVA百科 can't deal with nested tags

Possible to change the function so it is able to deal with nested tags? i.e instead of stopping at the next closing tag, it keeps track of any other nested tags and stops only after the correct closing tag

Any Help ?

Thanks

Edit: I know regex isn't reommended for parsing , and there is php DOM and simplehtmldom, but this parse array function works great and if if only it could deal with nested tags, it would be perfect! So any help with this would be greatly appreciated. Give me some kind of hint if not full solution please.

Edit: I know regex isn't reommended for parsing , and there is php DOM and simplehtmldom, but this parse array function works great and if if only it could deal with nested tags, it would be perfect! So any help with this would be greatly appreciated. Give me some kind of hint if not full solution please.

Regexes don't and can't count and keep track of things like that. This problem of nested tags is exactly why it's not recommended to parse HTML with regex, as it quickly becomes impossible. A parser may be more work, but it's much much more reliable.

There is one thing you could try though, which is removing the U (ungreedy) flag at the end of your regex. Being 'ungreedy' means it will match the first </div> tag it comes too, whereas being in default 'greedy' mode it will match the last instead. That may or may not work for your specific situation depending on your HTML, but it's worth a try at least. It doesn't solve the problem of trying to parse nested tags with regex in general though, so if that doesn't work you're going to have to use a parser instead.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Solution:

Simple DOM HTML Parser

继续阅读：php regex

Need help with modifying a function (regex)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？