Code to find strings in source code over many urls

2023-02-24 09:06 问答作者：

I want to enter a very long list of urls and search for specific strings within the source code, outputting a list of urls that contain the string. Sounds simple enough right? I have come up with the bellow code, the input being a html form. You can try it at pelican-cement.com/findfrog.

It seems to work half the time, but is thrown off by multiple urls/urls in different orders. Searching for 'adsense' it correctly ids politics1.com out of

cnn.com
politics1.com

however, if reversed the output is blank. How can I get reliable, consistent results? preferably something I could input thousands of urls into?

<html>
<body>

<?
set_time_limit (0);

$urls=explode("\n", $_POST['url']);

$allurls=count($urls);

for ( $counter = 0; $counter <= $allurls; $counter++) {

 $ch = curl_init();
 curl_setopt($ch, CURLOPT_URL,$urls[$counter]);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
 curl_setopt ($ch, CURLOPT_HEADER, 1); 
 curl_exec ($ch); 
 $curl_scraped_page=curl_exec($ch); 

$haystack=strtolower($curl_scraped_page);
$needle=$_POST['proxy'];
if (strlen(strstr($haystack,$needle))>0) {

echo $urls[$counter];
echo "<br/>";
curl_close($ch);
}
}




//$FileNameSQL = "/googleres开发者_运维百科earch" .  abs(rand(0,1000000000000000))  .  ".csv";
//$query = "SELECT * FROM happyturtle INTO OUTFILE '$FileNameSQL' FIELDS TERMINATED BY ','";
//$result = mysql_query($query) or die(mysql_error());

//exit;

echo '$FileNameSQL';





?>

</body>
</html>

Reorganized your code a bit. The main culprit was whitespace. You need to trim your URL string before using it (i.e. trim($url);).

Other changes:

Set your search term outside the for loop, since it never changes.
Setup the curl object outside the loop and reuse it by just changing the URL each time.
Use curl_setopt_array() to set multiple curl options in one statement.
Use a foreach loop, since you're iterating over the entire array anyway and the code is cleaner.
Using stripos() is more efficient than strstr() and is case-insensitive anyway.
Use the !== comparator to prevent implied typecasting (FALSE !== 0, but FALSE == 0).
Check the returned $html string as curl_exec() can return FALSE if it fails.
Close the curl object at the end (i.e. outside the if statement too).

The code below can be run on my quick mockup.

<html>
<body>

<form action="search.php" method="post"> 
  URLs: <br/>
  <textarea rows="20" cols="50" input type="text" name="url" /></textarea><br/>

  Search Term: <br/>
  <textarea rows="20" cols="50" input type="text" name="proxy" /></textarea><br/>

  <input type="submit" /> 
</form>

<?
  if(isset($_POST['url'])) {
    set_time_limit (0);

    $urls = explode("\n", $_POST['url']);
    $term = $_POST['proxy'];
    $options = array( CURLOPT_FOLLOWLOCATION => 1,
                      CURLOPT_RETURNTRANSFER => 1,
                      CURLOPT_CUSTOMREQUEST  => 'GET',
                      CURLOPT_HEADER         => 1,
                      );
    $ch = curl_init();
    curl_setopt_array($ch, $options);

    foreach ($urls as $url) {
      curl_setopt($ch, CURLOPT_URL, trim($url));
      $html = curl_exec($ch);

      if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
        echo $url;
      }
    }

    curl_close($ch);
  }
?>

</body>
</html>

Perhaps you should call

curl_close($ch);

Regardless of whether it finds the string in the scraped page or not. Aside from that I don't see anything obviously wrong with the code.

If its not something in the code, then its probably some difference in the scraped page. Maybe the page is dynamic, and doesn't always contain the needle word on subsequent checks. Maybe the server of the page you were trying to scrape returned an error code.

A couple of tweaks, not sure if they would help but still

$url_to_go = trim($urls[$counter]);
if($url_to_go){
 $ch = curl_init();
 curl_setopt($ch, CURLOPT_URL,$url_to_go);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
 curl_setopt ($ch, CURLOPT_HEADER, 1); 
 $curl_scraped_page=curl_exec($ch); 
 curl_close($ch);

 // more code follows
}

Could it be carriage returns/whitespace around the URLs that is throwing it off? It might be worth putting in a

$urls[$counter] = trim($urls[$counter]);

at the start of your for loop.

Also:

if (strpos($haystack, $needle) !== false) {
    [...]
}

is a more efficient way of checking if one string contains another. You could also use stripos here instead of strtolower()'ing the whole thing first (not sure if that would improve things).

继续阅读：curl explode php strstr web-scraping

Code to find strings in source code over many urls

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？