How to parse an HTML page using PHP?
Parsing HTML / JS codes to get info using PHP.
www.asos.com/Asos/Little-Asos-Union-Jack-T-Shirt/Prod/pgeproduct.aspx?iid=1273626
Take a look at this page, it's a clothes shop for kids. This is one of their items and I want to point out the size section. What we need to do here is to get all the sizes for this item and check whether the sizes are available or not. Right now all the sizes for this items are:
3-4 years
4-5 years
5-6 years
7-8 years
How can you say if the sizes are available or not?
Now take a look at this page first and check the sizes again:
www.asos.com/Ralph-Lauren/Ralph-Lauren-Long-Sleeve-Big-Horse-Stripe-Rugby-Top/Prod/pgeproduct.aspx?iid=1111751
This item has the following sizes:
12 months
18 months - Not Available
24 months
As you can see 18 months size is not available, it is indicated by the "Not Available" text next to the size.
What we need to do is go the page of an item, get the sizes and check the availability of each sizes. How can I do this in PHP?
EDIT:
Added a working code and a new problem to tackle.
Working code but it needs more work:
<?php
function getProductVariations($url) {
//Use CURL to get the raw HTML for the page
$ch = curl_init();
curl_setopt_array($ch,
array(
CURLOPT_RETURNTRANSFER=>true,
CURLOPT_HEADER => false,
CURLOPT_URL => $url
)
);
$raw_html = curl_exec($ch);
//If we get an invalid response back from the server fail
if ($raw_html===false) {
throw new Exception(curl_error($ch));
}
curl_close($ch);
//Find the variation JS declarations and extract them
$raw_variations = preg_match_all("/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[[0-9]+\].*Array\((.*)\);/",$raw_html,$raw_matches);
//We are done with the Raw HTML now
unset($raw_html);
//Check that we got some results back
if (is_array($raw_matches) && isset($raw_matches[1]) && sizeof($raw_matches[1])==$raw_variations && $raw_variations>0) {
//This is where the matches will go
$matches = array();
//Go through the results of the bracketed expression and conver开发者_Python百科t them to a PHP assoc array
foreach($raw_matches[1] as $match) {
//As they are declared in javascript we can use json_decode to process them nicely, they just need wrapping
$proc=json_decode("[$match]");
//Label the fields as best we can
$proc2=array(
"variation_id"=>$proc[0],
"size_desc"=>$proc[1],
"colour_desc"=>$proc[2],
"available"=>(trim(strtolower($proc[3]))=="true"),
"unknown_col1"=>$proc[4],
"price"=>$proc[5],
"unknown_col2"=>$proc[6], /*Always seems to be zero*/
"currency"=>$proc[7],
"unknown_col3"=>$proc[8],
"unknown_col4"=>$proc[9], /*Negative price*/
"unknown_col5"=>$proc[10], /*Always seems to be zero*/
"unknown_col6"=>$proc[11] /*Always seems to be zero*/
);
//Push the processed variation onto the results array
$matches[$proc[0]]=$proc2;
//We are done with our proc2 array now (proc will be unset by the foreach loop)
unset($proc2);
}
//Return the matches we have found
return $matches;
} else {
throw new Exception("Unable to find any product variations");
}
}
//EXAMPLE USAGE
try {
$variations = getProductVariations("http://www.asos.com/Asos/Prod/pgeproduct.aspx?iid=803846");
//Do something more useful here
print_r($variations);
} catch(Exception $e) {
echo "Error: " . $e->getMessage();
}
?>
The above code works, but there's a problem when the product needs you to select a colour first before the sizes are displayed.
Like this one:
http://www.asos.com/Little-Joules/Little-Joules-Stewart-Venus-Fly-Trap-T-Shirt/Prod/pgeproduct.aspx?iid=1171006
Any idea how to go about this?
SOLUTION:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
return curl_exec($ch);
curl_close ($ch);
}
$html = curl('http://www.asos.com/pgeproduct.aspx?iid=1111751');
preg_match_all('/arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct\[(.*?)\] \= new Array\((.*?),\"(.*?)\",\"(.*?)\",\"(.*?)\"/is',$html,$bingo);
echo print_r($bingo);
Link: http://debconf11.com/stackoverflow.php
You are on your own now :)
EDIT2:
Ok, we are close to solution...
<script type="text/javascript">var arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct = new Array;
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[0] = new Array(1164,"12 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[1] = new Array(1165,"18 months","SailingOrange","False","","59.00","0.00","£","","-59.00","0.00","0");
arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct[2] = new Array(1167,"24 months","SailingOrange","True","","59.00","0.00","£","","-59.00","0.00","0");
</script>
It is not loaded via ajax, instead array is in javascript variable. You can parse this with PHP, you can clearly see that 18 months is a False, which means it is not available.
EDIT:
This sizes are loaded via javascript, therefore you cannot parse them since they are not there. I can extract only this...
<select name="drpdwnSize" id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option>
</select>
You can sniff JS to check if you can load sizes based on product id.
First you need: http://simplehtmldom.sourceforge.net/ Forget file_get_contents() it is ~5 slower than cURL.
You then parse this piece of code (html with id ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize)
<select id="ctl00_ContentMainPage_ctlSeparateProduct_drpdwnSize" name="ctl00$ContentMainPage$ctlSeparateProduct$drpdwnSize" onchange="drpdwnSizeChange(this, 'ctl00_ContentMainPage_ctlSeparateProduct', arrSzeCol_ctl00_ContentMainPage_ctlSeparateProduct);">
<option value="-1">Select Size</option><option value="1164">12 months</option><option value="1165">18 months - Not Available</option><option value="1167">24 months</option></select>
You can then use preg_match(),explode(),str_replace() and others to filter out values you want. I can write it but I don't have time right now :)
The most simple way to fetch the content of a URL is to rely on fopen
wrappers and just use file_get_contents
with the URL. You can use the tidy extension to parse the HTML and extract content. http://php.net/tidy
You can download the file using fopen()
or file_get_contents()
, as Raoul Duke said, but if you have experience with the JavaScript DOM model, the DOM extension might be a bit easier to use than Tidy.
I know for a fact that the DOM extension is enabled by default in PHP, but I am a bit unsure if Tidy is (the manual page only says it's "bundeled", so I suspect that it might not be enabled).
精彩评论