Regex and PHP for extracting contents between tags with several line breaks
How can I extract the content between tags with several line breaks?
I'm a newbie to regex, who would like to know how to handle unknown numbers of line break to match my query.
Task: Extract content between <div class="test">
and the first closing </div>
tag.
Original source:
<div class="test">optional text<br/>
content<br/>
<br/>
content<br/>
...
content<br/><a href="/url/">Hyperlink</a></div></div></div>
I开发者_如何学运维've worked out the below regex,
/<div class=\"test\">(.*?)<br\/>(.*?)<\/div>/
Just wonder how to match several line breaks using regex.
There is DOM for us but I am not familiar with that.
You should not parse (x)html with regular expressions. Use DOM.
I'm a beginner in xpath, but one like this should work:
//div[@class='test']
This selects all divs with the class 'test'. You will need to load your html into a DOMDocument object, then create a DOMXpath object relating to that, and call its execute()
method to get the results. It will return a DOMNodeList object.
Final code looks something like this:
$domd = new DOMDocument();
$domd->loadHTML($your_html_code);
$domx = new DOMXPath($domd);
$items = $domx->execute("//div[@class='test']");
After this, your div is in $items->item(0)
.
This is untested code, but if I remember correctly, it should work.
Update, forgot that you need the content.
If you need the text content (no tags), you can simply call $items->item(0)->textContent
. If you also need the tags, here's the equivalent of javascript's innerHTML for PHP DOM:
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
Call it with $items->item(0)
as the parameter.
You could use preg_match_all('/<div class="test">(.*?)<\/div>/si', $html, $matches);
. But remember that this will match the first closing </div>
within the HTML. Ie. if the HTML looks like <div class="test">...aaa...<div>...bbb...</div>...ccc...</div>
then you would get ...aaa...<div>...bbb...
as the result in $matches...
So in the end using a DOM parser would indeed by a better solution.
精彩评论