Script to copy links from a lot of pages [closed]
I need copy links from a lot of pages, from the same site. looks like: /download.php?id=xxxxx Just need add 1 more in the id to have the needed pages... On those pages, i need take a link inside the code like: href="http://www.site.com/xxxxxxxxxxxx" (x as a variable)
开发者_运维问答It's possible? Thanks
Do not use REGEX to parse HTML
Perhaps the biggest mistake people make when trying to get URLs or link text from a web page is trying to do it using regular expressions. The job can be done with regular expressions, however, there is a high overhead in having preg loop over the entire document many times. The correct way, and the faster, and infinitely cooler ways is to use DOM. By using DOM in the getLinks functions it is simple to create an array containing all the links on a web page as keys, and the link names as values. This array can then be looped over like any array and a list created, or manipulated in any way desired. Note that error suppression is used when loading the HTML. This is to suppress warnings about invalid HTML entities that are not defined in the DOCTYPE. But of course, in a production environment, error reporting would be disabled and error reporting set to none.
<?php
function getLinks($link){
$ret = array();
/*** a new dom object ***/
$dom = new domDocument;
/*** get the HTML via FGC,
Tho prefer using cURL instead but that's out of scope of the question..
(@suppress those errors) ***/
@$dom->loadHTML(file_get_contents($link));
/*** remove silly white space ***/
$dom->preserveWhiteSpace = false;
/*** get the links from the HTML ***/
$links = $dom->getElementsByTagName('a');
/*** loop over the links ***/
foreach ($links as $tag){
/*** only add download links to the return array ***/
if(strpos($tag->getAttribute('href'),'/download.php?id=')!=false){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
}
return $ret;
}
?>
Example Usage
<?php
/*** a link to search ***/
$link = "http://www.site.com";
/*** get the links ***/
$urls = getLinks($link);
/*** check for results ***/
if(sizeof($urls) > 0){
foreach($urls as $key=>$value){
echo $key . ' - '. $value . ' - ' . str_ireplace('http://www.site.com/download.php?id=','',$key). '<br >';
}
}else{
echo "No links found at $link";
}
?>
精彩评论