Code to find strings in source code over many urls
I want to enter a very long list of urls and search for specific strings within the source code, outputting a list of urls that contain the string. Sounds simple enough right? I have come up with the bellow code, the input being a html form. You can try it at pelican-cement.com/findfrog.
It seems to work half the time, but is thrown off by multiple urls/urls in different orders. Searching for 'adsense' it correctly ids politics1.com out of
cnn.com
politics1.com
however, if reversed the output is blank. How can I get reliable, consistent results? preferably something I could input thousands of urls into?
<html>
<body>
<?
set_time_limit (0);
$urls=explode("\n", $_POST['url']);
$allurls=count($urls);
for ( $counter = 0; $counter <= $allurls; $counter++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$urls[$counter]);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_exec ($ch);
$curl_scraped_page=curl_exec($ch);
$haystack=strtolower($curl_scraped_page);
$needle=$_POST['proxy'];
if (strlen(strstr($haystack,$needle))>0) {
echo $urls[$counter];
echo "<br/>";
curl_close($ch);
}
}
//$FileNameSQL = "/googleres开发者_运维百科earch" . abs(rand(0,1000000000000000)) . ".csv";
//$query = "SELECT * FROM happyturtle INTO OUTFILE '$FileNameSQL' FIELDS TERMINATED BY ','";
//$result = mysql_query($query) or die(mysql_error());
//exit;
echo '$FileNameSQL';
?>
</body>
</html>
Reorganized your code a bit. The main culprit was whitespace. You need to trim your URL string before using it (i.e. trim($url);
).
Other changes:
- Set your search term outside the for loop, since it never changes.
- Setup the curl object outside the loop and reuse it by just changing the URL each time.
- Use curl_setopt_array() to set multiple curl options in one statement.
- Use a foreach loop, since you're iterating over the entire array anyway and the code is cleaner.
- Using stripos() is more efficient than strstr() and is case-insensitive anyway.
- Use the !== comparator to prevent implied typecasting (FALSE !== 0, but FALSE == 0).
- Check the returned $html string as curl_exec() can return FALSE if it fails.
- Close the curl object at the end (i.e. outside the if statement too).
The code below can be run on my quick mockup.
<html>
<body>
<form action="search.php" method="post">
URLs: <br/>
<textarea rows="20" cols="50" input type="text" name="url" /></textarea><br/>
Search Term: <br/>
<textarea rows="20" cols="50" input type="text" name="proxy" /></textarea><br/>
<input type="submit" />
</form>
<?
if(isset($_POST['url'])) {
set_time_limit (0);
$urls = explode("\n", $_POST['url']);
$term = $_POST['proxy'];
$options = array( CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HEADER => 1,
);
$ch = curl_init();
curl_setopt_array($ch, $options);
foreach ($urls as $url) {
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
echo $url;
}
}
curl_close($ch);
}
?>
</body>
</html>
Perhaps you should call
curl_close($ch);
Regardless of whether it finds the string in the scraped page or not. Aside from that I don't see anything obviously wrong with the code.
If its not something in the code, then its probably some difference in the scraped page. Maybe the page is dynamic, and doesn't always contain the needle word on subsequent checks. Maybe the server of the page you were trying to scrape returned an error code.
A couple of tweaks, not sure if they would help but still
$url_to_go = trim($urls[$counter]);
if($url_to_go){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url_to_go);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1);
$curl_scraped_page=curl_exec($ch);
curl_close($ch);
// more code follows
}
Could it be carriage returns/whitespace around the URLs that is throwing it off? It might be worth putting in a
$urls[$counter] = trim($urls[$counter]);
at the start of your for loop.
Also:
if (strpos($haystack, $needle) !== false) {
[...]
}
is a more efficient way of checking if one string contains another. You could also use stripos here instead of strtolower()'ing the whole thing first (not sure if that would improve things).
精彩评论