开发者

Get div and the correct close tag preg

Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head

What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find

Here is my Actual code

$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNT开发者_JS百科RANSFER, '1'); 
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);

preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);

This has been updated for TomcatExodus's help

Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6


<?php

$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1'); 
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");

if ($div) {
  $dom2 = new DOMDocument();
  $dom2->appendChild($dom2->importNode($div, true));
  echo $dom2->saveHTML();
} else {
  echo "Has no element with the given ID\n";
}


Using regular expression leads often to problems when parsing markup documents.

XPath version - independent of the source layout. The only thing you need is a div with that id.

loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[@id = 'torrent_details']");
$div=$result->item(0);

if($result->length){
    $out =new DOMDocument();
    $out->appendChild($out->importNode($div, true));
    echo $out->saveHTML();
}else{
    echo "No such id";
}
?>

And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.

validateOnParse = true;
@$domd->loadHTML($data);

//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"

//$div = $domd->getElementById("torrent_details");

/*
 * workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
 * set the "id" attribute as the real id
 */
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
  foreach ($elements as $element) {
    //try-catch needed because of elements with no id
    try{
    $element->setIdAttribute('id', true);
    }catch(Exception $e){}
}
}

//now it works
$div = $domd->getElementById("torrent_details");

//Print its content or error
if ($div) {
  $dom2 = new DOMDocument();
  $dom2->appendChild($dom2->importNode($div, true));
  echo $dom2->saveHTML();
} else {
  echo "Has no element with the given ID\n";
}

?>

Both of the solutions work for me.


You can do this: /]>(.)<\/div>/i

Which would give you the largest possible innerHTML.


You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.

You can use some trickery, but this is neither reliable, nor necessarily fast:

preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'

Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.


preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];

That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.


I'm retracting my answer. This will not work properly. Use DOM for navigating the document.


haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple

$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1'); 
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);

print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜