开发者

Scraping (Regex) Issues

I've been trying to build a simple scraper that would take a keyword, then go to Amazon and enter the keyword into the search box, then scrape the main results only.

The problem is that the Regex isn't working. I've tried many different ways, but it's still not working properly.

   $url = "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=dog+bed&x=0&y=0";

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$return = curl_exec($ch);
curl_close($ch);

preg_match_all('(<div.*class="data">.*<div class="title">.*<a.*class="title".*href="(.*?)">(.*?)</a>)', $return, $matches);

var_dump($matches);

Now Amazon's HTML code looks like this:

<div class="title">
<a class="title" href="https://rads.stacko开发者_运维技巧verflow.com/amzn/click/com/B00063KG7S" rel="nofollow noreferrer">Midwest 40236 36-By-23-Inch Quiet Time Bolster Pet Bed, Fleece</a>
        <span class="ptBrand">by Midwest Homes for Pets</span>
 <span class="bindingAndRelease">(Nov 30, 2006)</span>
        </div>

I've tried to change the Regex a million different ways, but what I've learned over the past few months just isn't working, at all. Of course, if I just change it to href="(.*?)" - I get every link on there...but not when I add in the

Any advice would be appreciated!


Parsing complex structures with a regex often fails. The regex gets complicate and even you put lot of efforts in, it never properly works. That's by the nature of the data you would like to analyse and the limitation of regexes.

When website's weren't that complex, I did the following which often works well for a quick solution:

find a string that marks the beginning of the part that is interesting, cut everyhing out before. Then find a string that marks the end and cut out everything afterwards.

and then parse :)

nowadays if you need something flexible, write yourself a cache layer so you automatically can have a copy of the resources you need to scrape so you can code your scraper w/o the overhead to request external data all the time all over again while developing the right scraping strategy (it does not change that fast).

Then convert the HTML into XML for example with DomDocument in PHP. That works very well once you've done that two or three times. You might run in encoding problems and syntax problems, but those can be solved. And things got much better compared to some years ago.

Then you could step into Xpath which is quite flexible to run expressions on the XML.

But next to that there is a PHP lib that really is super-cool: FluentDOM.

It combines the best of DomDocument, XPath and PHP and is quite flexible.

Some examples & resources by the author of FluentDOM I can suggest:

  • Scraping Links From HTML (FluentDOM)
  • Using PHP DOM With XPath
  • Highlight Words In HTML (FluentDOM)


It might be important to note that questions like this that request help scraping copyright protected content are in violation of SO's Terms of Use, specifically the section on Subscriber Content which states:

"Subscriber represents, warrants and agrees that it will not contribute any Subscriber Content that (a) infringes, violates or otherwise interferes with any copyright or trademark of another party"

See https://meta.stackexchange.com/questions/93698/web-scraping-intellectual-property-and-the-ethics-of-answering/93701#93701 for an ongoing discussion of this issue.


You should probably use an XML parser + XPath instead of a regexp to do that… XML + RE = bad idea

Plus, isn't doing what you intend to do it against Amazon Termes of Use?


I've not done this in PHP, but I've done similar things in Python. I suspect the correct approach is to use an HTML DOM parser like http://simplehtmldom.sourceforge.net/, which parses the HTML and turns it into objects for you to use.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜