开发者

PHP DOMDocument - what's my "real" document URI?

I'm trying to do some HTML DOM parsing. The parsing I am doing is dependent on the URI of the page. The problem is that when I load an HTML file like in the following:

// Creat HTML DOM
$dom_document = new DOMDocument();
@$dom_document->loadHTMLFile('http://www.google.com/');

I am 开发者_Go百科sometimes redirected by the site (e.g. Google may redirect me to a country specific domain). Questions:

  1. How do I prevent being redirected? I want to explicitly state which page I want to parse -- and not be sent to another page. I don't need to use DOMDocument.
  2. If there is no way to prevent being redirected, is there at least a way to know what the URI I was sent to?

EDIT 1:

function get_html_content($url)
        {
            $ch      = curl_init();

            curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
            curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE); // not good for 301 redirects
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
            curl_setopt($ch, CURLOPT_URL, $url);

            $data = curl_exec($ch);

            // Check if any error occured
            if(curl_errno($ch))
            {
                echo 'Curl error: ' . curl_error($ch);
                assert(FALSE);
                die();
            }

            curl_close($ch);

            return $data;
        }


The answer is "yes" on both counts, but not using loadHTMLFile().

If you can, use curl. It provides much more detailed control over redirections.

Fetch the contents with it, and import them to your DOMDocument using loadHTML().

See e.g.

  • cURL , get redirect url to a variable

  • How do I CURL www.google.com - it keeps redirecting me to .co.uk

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜