Need help with modifying a function (regex)
I am using the parse_array function from the book Webbots, Spiders, and Screen Scrapers for my parsing needs. However I need to modify this function little, and I don't know how to.
The function:
function parse_array($string, $beg_tag, $close_tag)
{
preg_match_all("($beg_tag(.*)$close_tag)siU", $string, $matching_data);
return $matching_data[0];
}
How it works:
$html="<div>
afterfirst
<div>nested</div>
this is lost
</div>
<div>div2</div>" ;
$div_array = parse_array($html,"<div", "</div>") ;
echo $div_array[0]. "</br>" ;
//outputs:
<div>
afterfirst
<div>nested</div>
//the line "this is lost" and the last </div> isn't included.
Basically the function开发者_JAVA百科 can't deal with nested tags
Possible to change the function so it is able to deal with nested tags? i.e instead of stopping at the next closing tag, it keeps track of any other nested tags and stops only after the correct closing tag
Any Help ?
Thanks
Edit: I know regex isn't reommended for parsing , and there is php DOM and simplehtmldom, but this parse array function works great and if if only it could deal with nested tags, it would be perfect! So any help with this would be greatly appreciated. Give me some kind of hint if not full solution please.
Edit: I know regex isn't reommended for parsing , and there is php DOM and simplehtmldom, but this parse array function works great and if if only it could deal with nested tags, it would be perfect! So any help with this would be greatly appreciated. Give me some kind of hint if not full solution please.
Regexes don't and can't count and keep track of things like that. This problem of nested tags is exactly why it's not recommended to parse HTML with regex, as it quickly becomes impossible. A parser may be more work, but it's much much more reliable.
There is one thing you could try though, which is removing the U
(ungreedy) flag at the end of your regex. Being 'ungreedy' means it will match the first </div>
tag it comes too, whereas being in default 'greedy' mode it will match the last instead. That may or may not work for your specific situation depending on your HTML, but it's worth a try at least. It doesn't solve the problem of trying to parse nested tags with regex in general though, so if that doesn't work you're going to have to use a parser instead.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Solution:
Simple DOM HTML Parser
精彩评论