开发者

Code to visit multiple web pages using php

Let me explain the scenario: I have a table with around 1 million records. Records are hyperlinks of some remote webpage开发者_StackOverflows. I need to create a php page say work.php which will take each link from the table and visit those links using commands like file_get_contents and send the response to one email id. I need the best way to code work.php.

I tried retreiving the entire links from db and parsing each row using a while loop. inside the while loop i will use file_get_contents to fetch response and then send mail. it is showing timeout as there are million records.

Please give me the best optimized code. Any help is appreciated. Thanks


Keep a database table with all the addresses you inted to crawl, and their status.

In your script, search the first record that is marked unprocessed, and process it; after a successful crawl, send the E-Mail, mark the record as finished, and move on to the next record.

Do only a small number of records per request.

If a request times out repeatedly, your script will fail repeatedly. In that case, mark it as broken in your table (doing this adds a bit of complication but is probably necessary. What you will need to do here is to record when you start crawling a site, and when it's finished. If there's multiple starts, but no finish, there is a problem with the URL).

Use a header redirect, or a refresh meta tag, or a cron job to call the same script repeatedly.

Do this until there are no more unprocessed records in the table.


Multithread it and don't use the web to trigger it:

  1. Run this using PHP CLI so that you don't need to kept making web requests though apache to run the script. This will speed things up dramatically. This should also get around the timeouts.
  2. Track process state in the db. Add a flag (char(1)) to the table to indicate the processing state. i.e. null,P,S,F,E for Processing,Started,Failed,Error.
  3. When you request a set of url's to process, mark them as 'P' AND request only those that are not marked 'P' (null). Now you can run multiple copies of the same script at once and each will grab different url's to process.
  4. During processing mark the url with the other flags when those events occur like and Error or you Start fetching it or Finish processing.

Your typical home box should be able to run several instances of this concurrently.


Try using set_time_limit(0); at the top of your code so the script does not timeout.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜