开发者

how to scrape web page data without losing tags

I am trying to scrape web data using php and dom xpath. When I store the $node->nodeValue into my database or even if i try to echo it, all the tags like &开发者_StackOverflow中文版lt;p> and <br> are missing. So I am getting all the paras concatenated. How to solve this problem


If you have a node, and you need all its contents as they are, you can use this function:

function innerHTML(DOMNode $node)
{
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child) {
    $doc->appendChild($doc->importNode($child, true));
  }
  return $doc->saveHTML();
}


If you're browsing the DOM, most likely there are no longer tags to see. The tags are now nodes within the DOM -- the raw content contained in tags is all you have access to in "string form". You can, of course, use node information to reconstruct the tags, but they won't be the original tags (e.g., you will have to choose <BR> or <br> - you won't know which the site originally had). If you want the original tags from the get go, get the original stream of bytes returned by the GET/POST you did; don't parse it into a DOM tree.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜