What's the most efficient way to get this data, thousands of times?

2023-02-17 07:54 问答作者：

What would be the best way to get the following data (the 4.0m after the </b> tag) using PHP's DOMDocument->loadHTML() system? I'm guessing some kind of CSS-stye selector?

(LINE 240, always 240) <b>Current Price:</b> 4.0m

I have been looking around the documentation, but to be honest this is all completely alien to me! Furthermore, how would I be able to get this data for thousands of pages, from URLs such as:

http://site.com/q=item/viewitem.php?obj=11928

The obj=# minimum/maximum values are known (how many pages I will need to scrape), and I want to grab all of them, incrementally, and output name description and price (not terribly concerned about the percentage rise/drop as of yet) to a MySQL database, so I can grab it from there and display it in my site.

Here is the main block of code that I am interested in:

<div class="subsectionHeader"> 
<h2> 
Item Name
</h2> 
</div> 
<div id="item_additional" class="inner_brown_box">  
Descript开发者_开发问答ion of item goes here.
<br> 
<br> 
<b>Current Price:</b> 4.0m
<br><br> 
<b>Change in Price:</b><br> 
<span> 
<b>30 Days:</b> <span class="rise">+2.5%</span> 
</span> 
<span class="spaced_span"> 
<b>90 Days:</b> <span class="drop">-30.4%</span> 
</span> 
<span class="spaced-span"> 
<b>180 Days:</b> <span class="drop">-33.3%</span> 
</span> 
<br class="clear"> 
</div> </div> <div class="brown_box main_page"> 
<div class="subsectionHeader"> `

If anyone could provide any skeletal hints on how to go about this, it would be much appreciated!

Parsing HTML with regular expressions is usualy bad idea, but in your case it may me right/easy way. It's fast enough and maybe more flexible than chunking with strpos and plain text patterns.

Try this example with source HTML given above:

//checked with php 5.3.3
if (preg_match('#<h2>(?P<itemName>[^>]+)</h2>.*?<div[^>]+id=([\'"])item_additional(\2)[^>]*>\s*(?P<description>[^<]+).*?<b>\s*Current\s+Price\s?:?</b>\s*(?P<price>[^<]+)#six',$src, $matches))
{
    print_r($matches);
}

Regular expressions might look too complex, but with documenation and nice tools like RegexBuddy or Expresso anyone can write simple ones ;)

You could use Simple HTML DOM Parser - http://simplehtmldom.sourceforge.net/

Extract the contents using:

echo file_get_html('http://www.google.com/')->plaintext;

And then locate the 4.0m using a PHP str function.

DOM parsing is the most robust way to do this.

If you want the fastest way, and know that the HTML structure is consistent, it would probably be faster to use strpos to search for offsets. It is more likely to break if the page structure changes, though. Something like this:

$needles = array(
  'name' => "<div class=\"subsectionHeader\">\n<h2>\n"
  'description' => "<div id=\"item_additional\" class=\"inner_brown_box\">\n"
  'price' => "<b>Current Price:</b> "
);
$buffer = file_get_contents("http://site.com/q=item/viewitem.php?obj=1234");
$result = array();
foreach ($needles as $key => $needle) {
  $index1 = strpos($buffer, $needle);
  $index2 = strpos($buffer, "\n", $index1);
  $value = substr($buffer, $index1, $index2 - $index1);
  $result[$key] = $value;
}

You will need to get the needles exactly right, including any trailing whitespace.

继续阅读：curl html-parsing php scrape

What's the most efficient way to get this data, thousands of times?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？