PHP - Processing a Screen Scraped Page

2023-02-11 12:55 问答作者：

I have used previous topics on how to scrape a webpage successfully using cURL and PHP. I have managed to get that part working fine, what I need to do is process some information from the page that has no identifiable classes / markup that I can use easily. The example code I have is:

<h3>Building details:</h3>
<p>Disabled ramp access<br />
  Male, female and disabled toilets available</p>
  <br/>
  <p><strong>Appointment lead times:</strong></p>
  <p><strong>Type 1</strong>:&nbsp; 8 weeks<br />
  <strong>Type 2</strong>:&nbsp;5 weeks<br />
  <strong>Type 3</strong>:&nbsp;3 weeks<br />
  <strong>Type 4</strong>:&nbsp;3 weeks
</p>

What I need to do is get the number of weeks lead time for the different types of appointment, mainly type 1. Sometimes appointment lead times are unavailable and states:

<p><strong>Appointment lead times:</strong></p>
<p><strong>Type 1</strong>:&n开发者_运维技巧bsp; No information available<br />

I have looked at several methods, RegEx, Simple DOM Parser etc but haven't really got a solution to what I am trying to achieve.

Many thanks.

When doing this kind of thing, it can get messy. You have to find some point in the code to break it apart in a reliable way. Your sample there has one spot I can see: Type 1</strong>: . So, I would do this:

$parts = explode('Type 1</strong>: ', $text);

Now, the first bit of $parts[1] will have either your timeframe, or the no information message. Let's use the <br /> at the end to chop it:

if (count($parts) == 2) {
  $parts = explode('<br />', $parts[1]);
  $parts = trim(str_replace(' weeks', '', $parts[0]));
}

Now, $parts has our message, or our timeframe as a number. is_numeric will show the way! This is a dirty method, but scraping page data usually is. Be sure to check the results of each step before assuming you're good for the next.

use http://php.net/manual/en/book.tidy.php to convert into valid xml , then you can easily query using xpath via simplexml http://www.w3schools.com/php/php_xml_dom.asp

继续阅读：php screen-scraping

PHP - Processing a Screen Scraped Page

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？