开发者

PHP - Processing a Screen Scraped Page

I have used previous topics on how to scrape a webpage successfully using cURL and PHP. I have managed to get that part working fine, what I need to do is process some information from the page that has no identifiable classes / markup that I can use easily. The example code I have is:

<h3>Building details:</h3>
<p>Disabled ramp access<br />
  Male, female and disabled toilets available</p>
  <br/>
  <p><strong>Appointment lead times:</strong></p>
  <p><strong>Type 1</strong>:&nbsp; 8 weeks<br />
  <strong>Type 2</strong>:&nbsp;5 weeks<br />
  <strong>Type 3</strong>:&nbsp;3 weeks<br />
  <strong>Type 4</strong>:&nbsp;3 weeks
</p>

What I need to do is get the number of weeks lead time for the different types of appointment, mainly type 1. Sometimes appointment lead times are unavailable and states:

<p><strong>Appointment lead times:</strong></p>
<p><strong>Type 1</strong>:&n开发者_运维技巧bsp; No information available<br />

I have looked at several methods, RegEx, Simple DOM Parser etc but haven't really got a solution to what I am trying to achieve.

Many thanks.


When doing this kind of thing, it can get messy. You have to find some point in the code to break it apart in a reliable way. Your sample there has one spot I can see: Type 1</strong>:&nbsp;. So, I would do this:

$parts = explode('Type 1</strong>:&nbsp;', $text);

Now, the first bit of $parts[1] will have either your timeframe, or the no information message. Let's use the <br /> at the end to chop it:

if (count($parts) == 2) {
  $parts = explode('<br />', $parts[1]);
  $parts = trim(str_replace(' weeks', '', $parts[0]));
}

Now, $parts has our message, or our timeframe as a number. is_numeric will show the way! This is a dirty method, but scraping page data usually is. Be sure to check the results of each step before assuming you're good for the next.


use http://php.net/manual/en/book.tidy.php to convert into valid xml , then you can easily query using xpath via simplexml http://www.w3schools.com/php/php_xml_dom.asp

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜