PHP - Processing a Screen Scraped Page
I have used previous topics on how to scrape a webpage successfully using cURL and PHP. I have managed to get that part working fine, what I need to do is process some information from the page that has no identifiable classes / markup that I can use easily. The example code I have is:
<h3>Building details:</h3>
<p>Disabled ramp access<br />
Male, female and disabled toilets available</p>
<br/>
<p><strong>Appointment lead times:</strong></p>
<p><strong>Type 1</strong>: 8 weeks<br />
<strong>Type 2</strong>: 5 weeks<br />
<strong>Type 3</strong>: 3 weeks<br />
<strong>Type 4</strong>: 3 weeks
</p>
What I need to do is get the number of weeks lead time for the different types of appointment, mainly type 1. Sometimes appointment lead times are unavailable and states:
<p><strong>Appointment lead times:</strong></p>
<p><strong>Type 1</strong>:&n开发者_运维技巧bsp; No information available<br />
I have looked at several methods, RegEx, Simple DOM Parser etc but haven't really got a solution to what I am trying to achieve.
Many thanks.
When doing this kind of thing, it can get messy. You have to find some point in the code to break it apart in a reliable way. Your sample there has one spot I can see: Type 1</strong>:
. So, I would do this:
$parts = explode('Type 1</strong>: ', $text);
Now, the first bit of $parts[1] will have either your timeframe, or the no information message. Let's use the <br />
at the end to chop it:
if (count($parts) == 2) {
$parts = explode('<br />', $parts[1]);
$parts = trim(str_replace(' weeks', '', $parts[0]));
}
Now, $parts has our message, or our timeframe as a number. is_numeric
will show the way! This is a dirty method, but scraping page data usually is. Be sure to check the results of each step before assuming you're good for the next.
use http://php.net/manual/en/book.tidy.php to convert into valid xml , then you can easily query using xpath via simplexml http://www.w3schools.com/php/php_xml_dom.asp
精彩评论