Best way to go about reading a website?
I'm trying to create a program that grabs data from a website x amount of times and I'm looking for a way to go about doing so without huge delays in the process.
Currently I use the following code, and it's rather slow (even though it is only grabbing 4 peoples' names, I'm expecting to do about 100 at a time):
$skills = array(
"overall", "attack", "defense", "strength", "constitution", "ranged",
"prayer", "magic", "cooking", "woodcutting", "fletching", "fishing",
"firemaking", "crafting", "smithing", "mining", "herblore", "agility",
"thieving", "slayer", "farming", "runecrafting", "hunter", "construction",
"summoning", "dungeoneering"
);
$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");//explode("\r\n", $_POST['names']);
$skill = isset($_GET['skill']) ? array_search($skills, $_GET['skill']) : 0;
display($participants, $skills, array_search($_GET['skill'], $skills));
function getAllStats($participants) {
$stats = array();
for ($i = 0; $i < count($participants); $i++) {
$stats[] = getStats($participants[$i]);
}
return $stats;
}
function display($participants, $skills, $stat) {
$all = getAllStats($participants);
for ($i = 0; $i < count($participants); $i++) {
$rank = getSkillData($all[$i], 0, $stat);
$level = getSkillData($all[$i], 1, $stat);
$experience = getSkillData($all[$i], 3, $stat);
}
}
function getStats($username) {
$curl = curl_init("http://hiscore.runescape.com/index_lite.ws?player=" . $username);
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, (int) $header);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
$httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
$output = curl_exec($curl);
curl_close ($curl);
if (strstr($output, "<html><head><title>")) {
return false;
}
return $output;
}
function getSkillData($stats, $row, $skill) {
$stats = explode("\n", $stats);
$levels = explode(",", $stats[$skill]);
return $levels[$row];
}
When I benchmarked this it took about 5 seconds, which isn't too bad, but imagine if I was doing this 93 more times. I understand it won't be instant, but I'd like to shoot for under 30 seconds. I know it's possible because I've seen websites which do something similar and they act within a 30 second time period.
I've read about using caching the data but that won't work because, simply, it will be old. I'm using a database (further on, I haven't gotten to that part yet) to store old data and retrieve new data which will be real time (what you see below).
Is there a way to achieve doing something like this without massive delays (and possibly overloading the server I am reading from)?
P.S: The website I am reading from is just text, it doesn't have any HTML to parse, which should reduce the loading time. Here's an example of what a page looks like (they're all the same, just different numbers):
69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1
My previous benchmark with this method vs. curl_multi_exec
:
function getTime() {
$timer = explode(' ', microtime());
$timer = $timer[1] + $timer[0];
return $timer;
}
function benchmarkFunctions() {
$start = getTime();
old_f();
$end = getTime();
echo 'function old_f() took ' . round($end - $start, 4) . ' seconds to complete<br><br>';
$startt = getTime();
new_f();
$endd = getTime();
echo 'function new_f() took ' . round($endd - $startt, 4) . ' seconds to complete';
}
function old_f() {
$test = array("A E T开发者_开发百科", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
getAllStats($test);
}
function new_f() {
$test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
$curl_arr = array();
$master = curl_multi_init();
$amt = count($test);
for ($i = 0; $i < $amt; $i++) {
$curl_arr[$i] = curl_init('http://hiscore.runescape.com/index_lite.ws?player=' . $test[$i]);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master, $running);
} while ($running > 0);
for ($i = 0; $i < $amt; $i++) {
$results = curl_exec($curl_arr[$i]);
}
}
When you are doing a bunch of network requests like this, you are at the mercy of the network and the remote server regarding how much time they take to respond.
Because of this, the best way to make all of your requests complete in the shortest amount of time is probably to do them all at once. Spawn a new thread for each one. For the size of the data you're working with it's probably quite possible to do literally all at once, but if that's a problem then maybe try 20 or so at once.
EDIT: I just realized you're using PHP which doesn't have threads. Well, probably a poor choice of language, for starters. But you might be able to emulate threads by forking new processes. This might be a wreck, though, if PHP is running inside the web server process, since it would clone the whole server. I'll look into whether PHP offers some sort of asynchronous web requests that could give a similar effect.
EDIT 2:
Here is a page discussing how to launch an HTTP request in the background with PHP:
http://w-shadow.com/blog/2007/10/16/how-to-run-a-php-script-in-the-background/
However, this is "fire and forget," it doesn't let you pick up the response to your request and do something with it. One approach you could take with this, though, would be to use this method fire off many requests to a different page on your own server, and have each one of those pages make a single request to the remote server. (Or, each worker request could process a batch of requests if you don't want to put start too many requests at once.)
You would still need a way to assemble all the results, and a way to detect when the whole procedure is complete so you can display the results. I would probably use either the database or the filesystem to coordinate between the different processes.
(Again, choosing a more powerful language for this task would probably be helpful. In the realm of languages similar to PHP, I know Perl would handle this problem very easily with "use threads", and I imagine Python or Ruby would as well.)
EDIT 3:
Another solution, this one using the UNIX shell to get around PHP's limitations by doing the work in separate processes. You can do a command something like this:
echo '$urlList' | xargs -P 10 -r -n1 wget
You would probably want to play with the wget
options a bit, such as specifying the output file explicitly, but this is the generally idea. In place of wget you could also use curl
, or even just call a PHP script that's designed to be run from the command line if you want complete control over the job of fetching the pages.
Again, with this solution you still have the problem of recognizing when the job is done so you can show the results.
I got this idea for this approach from this page:
http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget
You can reuse curl connections. Also, I changed your code to check the httpCode
instead of using strstr
. Should be quicker.
Also, you can setup curl to do it in parallel, which I've never tried. See http://www.php.net/manual/en/function.curl-multi-exec.php
An improved getStats()
with reused curl handle.
function getStats(&$curl,$username) {
curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
$output = curl_exec($curl);
if (curl_getinfo($curl, CURLINFO_HTTP_CODE)!='200') {
return null;
}
return $output;
}
Usage:
$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");
$curl = curl_init();
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, 0); //dangerous! will wait indefinitely
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, false);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
//try:
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
'Connection: Keep-Alive',
'Keep-Alive: 300'
));
header('Content-type:text/plain');
foreach($participants as &$user) {
$stats = getStats($curl, $user);
if($stats!==null) {
echo $stats."\r\n";
}
}
curl_close($curl);
Since you are making multiple requests to the same host, you can re-use the curl handle and if the site supports keep-alive requests, it could speed up your process a good bit over many requests.
You can change your function like this:
function getStats($username) {
static $curl = null;
if ($curl == null) {
$curl = curl_init();
}
curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
curl_setopt ($curl, CURLOPT_HTTPHEADER, array('Connection: Keep-Alive'));
//...
// remove curl_close($curl)
}
Doing this will make it so you don't have to close and re-establish the socket for every user request. It will use the same connection for all the requests.
curl
is a very good way to read the content of the website - I suppose your problem is because of the time require to download ONE page. If you can get all the 100 pages in parallel then you would probably have it all processed in under 10 seconds.
In order to avoid working with threads, locks, semaphores, and all the challenging stuff on threads, read this article and find out a way to make your application parallel almost for free.
精彩评论