saving unknown files with curl w/ PHP 5.3.x
I'm trying to archive a web base forum that has attachments that users have posted. So far, I made use of the php cURL library to get the individual topics and have been able to save the raw pages. However, I now need to figure out a way to archive the attachments that are located on the site.
Here is the problem: Since the file type is not consistent, I need to find a way to save the files with the correct extension. Note that I plan to rename the file when I save it so that it's organized in a way that it can be easily found later.
The link to the attached files in a page is in the format:
<a href="https://example.com/get_file?fileId=4342343212223">some file.txt</a>
I've already used preg_match() to get the URL's to the attached files. My biggest problem now is now just making sure the fetched file i开发者_Go百科s saved in the correct format.
My question: Is there any way to get the file type efficiently? I'd rather not have to use a regular expression, but I'm not seeing any other way.
Does the server add the correct Content-Type header field when serving the files? You can then intercept that with setting CURLOPT_HEADER
or file_get_contents
+ $http_response_header
.
http://www.php.net/manual/en/reserved.variables.httpresponseheader.php
i would look into
http://www.php.net/manual/en/book.fileinfo.php
to see if you can automatically grab the file type when you get ahold of it.
you can use DOMDocument and DOMXpath to extract urls and filename safely.
$doc=new DOMDocument();
$doc->loadHTML($content);
$xpath= new DOMXpath($doc);
//query examples:
foreach($xpath->query('//a') as $node)
echo $node->nodeValue;
foreach($xpath->query('//a/@href') as $node)
echo $node->nodeValue;
精彩评论