How to match anything except a pattern between two tags
I am attempting to match a string which is composed of HTML. Basically it is an image gallery so there is a lot of similarity in the string. There are a lot of <dl>
tags in the string, but I am looking to match the last <dl>(.?)+</dl>
combo that comes before a </div>
.
The way I've devised to do this is to make sure that there aren't any <dl
's inside the <dl></dl>
combo I'm matching. I don't care what else is there, including othe开发者_如何学Pythonr tags and line breaks.
I decided I had to do it with regular expressions because I can't predict how long this substring will be or anything that's inside it.
Here is my current regex that only returns me an array with two NULL indicies:
preg_match_all('/<dl((?!<dl).)+<\/dl>(?=<\/div>)/', $foo, $bar)
As you can see I use negative lookahead to try and see if there is another <dl>
within this one. I've also tried negative lookbehind here with the same results. I've also tried using +?
instead of just +
to no avail. Keep in mind that there's no pattern <dl><dl></dl>
or anything, but that my regex is either matching the first <dl>
and the last </dl>
or nothing at all.
Now I realize .
won't match line breaks but I've tried anything I could imagine there and it still either provides me with the NULL indicies or nearly the whole string (from the very first occurance of <dl
to </dl></div>
, which includes several other occurances of <dl>
, exactly what I didn't want). I honestly don't know what I'm doing incorrectly.
Thanks for your help! I've spent over an hour just trying to straighten out this one problem and it's about driven me to pulling my hair out.
Don't use regular expressions for irregular languages like HTML. Use a parser instead. It will save you a lot of time and pain.
I would suggest to use tidy instead. You can easily extra all the desired tags with their contents, even for broken HTML.
In general I would not recommend to write a parser using regex.
See http://www.php.net/tidy
As crazy as it is, about 2 minutes after I posted this question, I found a way that worked.
preg_match_all('/<dl([^\z](?!<dl))+?<\/dl>(?=<\/div>)/', $foo, $bar);
The [^\z]
craziness is just a way I used to say "match all characters, even line breaks"
精彩评论