PHP DOMDocument - what's my "real" document URI?
I'm trying to do some HTML DOM parsing. The parsing I am doing is dependent on the URI of the page. The problem is that when I load an HTML file like in the following:
// Creat HTML DOM
$dom_document = new DOMDocument();
@$dom_document->loadHTMLFile('http://www.google.com/');
I am 开发者_Go百科sometimes redirected by the site (e.g. Google may redirect me to a country specific domain). Questions:
- How do I prevent being redirected? I want to explicitly state which page I want to parse -- and not be sent to another page. I don't need to use DOMDocument.
- If there is no way to prevent being redirected, is there at least a way to know what the URI I was sent to?
EDIT 1:
function get_html_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE); // not good for 301 redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
// Check if any error occured
if(curl_errno($ch))
{
echo 'Curl error: ' . curl_error($ch);
assert(FALSE);
die();
}
curl_close($ch);
return $data;
}
The answer is "yes" on both counts, but not using loadHTMLFile()
.
If you can, use curl. It provides much more detailed control over redirections.
Fetch the contents with it, and import them to your DOMDocument using loadHTML()
.
See e.g.
cURL , get redirect url to a variable
How do I CURL www.google.com - it keeps redirecting me to .co.uk
精彩评论