Extract doctype with simple_html_dom
I am using simple_html_dom
to parse a website.
Is there a way to extract the doctype?
You can use file_get_contents
function to get all HTML data from website.
For example
<?php
$html = file_get_contents("http://google.com");
$html = str_replace("\n","",$html);
$get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
$doctype = $matches[1][0];
?>
You can use $html->find('unknown')
. This works - at least - in version 1.11 of the simplehtmldom library. I use it as follows:
function get_doctype($doc)
{
$els = $doc->find('unknown');
foreach ($els as $e => $el)
if ($el->parent()->tag == 'root')
return $el;
return NULL;
}
That's just to handle any other 'unknown' elements which might be found; I'm assuming the first will be the doctype. You can explicitly inspect ->innertext
if you want to ensure it starts with '!DOCTYPE '
, though.
精彩评论