Scraping (Regex) Issues

2023-03-10 02:06 问答作者：

I've been trying to build a simple scraper that would take a keyword, then go to Amazon and enter the keyword into the search box, then scrape the main results only.

The problem is that the Regex isn't working. I've tried many different ways, but it's still not working properly.

   $url = "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=dog+bed&x=0&y=0";

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$return = curl_exec($ch);
curl_close($ch);

preg_match_all('(<div.*class="data">.*<div class="title">.*<a.*class="title".*href="(.*?)">(.*?)</a>)', $return, $matches);

var_dump($matches);

Now Amazon's HTML code looks like this:

<div class="title">
<a class="title" href="https://rads.stacko开发者_运维技巧verflow.com/amzn/click/com/B00063KG7S" rel="nofollow noreferrer">Midwest 40236 36-By-23-Inch Quiet Time Bolster Pet Bed, Fleece</a>
        <span class="ptBrand">by Midwest Homes for Pets</span>
 <span class="bindingAndRelease">(Nov 30, 2006)</span>
        </div>

I've tried to change the Regex a million different ways, but what I've learned over the past few months just isn't working, at all. Of course, if I just change it to href="(.*?)" - I get every link on there...but not when I add in the

Any advice would be appreciated!

Parsing complex structures with a regex often fails. The regex gets complicate and even you put lot of efforts in, it never properly works. That's by the nature of the data you would like to analyse and the limitation of regexes.

When website's weren't that complex, I did the following which often works well for a quick solution:

find a string that marks the beginning of the part that is interesting, cut everyhing out before. Then find a string that marks the end and cut out everything afterwards.

and then parse :)

nowadays if you need something flexible, write yourself a cache layer so you automatically can have a copy of the resources you need to scrape so you can code your scraper w/o the overhead to request external data all the time all over again while developing the right scraping strategy (it does not change that fast).

Then convert the HTML into XML for example with DomDocument in PHP. That works very well once you've done that two or three times. You might run in encoding problems and syntax problems, but those can be solved. And things got much better compared to some years ago.

Then you could step into Xpath which is quite flexible to run expressions on the XML.

But next to that there is a PHP lib that really is super-cool: FluentDOM.

It combines the best of DomDocument, XPath and PHP and is quite flexible.

Some examples & resources by the author of FluentDOM I can suggest:

Scraping Links From HTML (FluentDOM)
Using PHP DOM With XPath
Highlight Words In HTML (FluentDOM)

It might be important to note that questions like this that request help scraping copyright protected content are in violation of SO's Terms of Use, specifically the section on Subscriber Content which states:

"Subscriber represents, warrants and agrees that it will not contribute any Subscriber Content that (a) infringes, violates or otherwise interferes with any copyright or trademark of another party"

See https://meta.stackexchange.com/questions/93698/web-scraping-intellectual-property-and-the-ethics-of-answering/93701#93701 for an ongoing discussion of this issue.

You should probably use an XML parser + XPath instead of a regexp to do that… XML + RE = bad idea

Plus, isn't doing what you intend to do it against Amazon Termes of Use?

I've not done this in PHP, but I've done similar things in Python. I suspect the correct approach is to use an HTML DOM parser like http://simplehtmldom.sourceforge.net/, which parses the HTML and turns it into objects for you to use.

继续阅读：php regex web-scraping

Scraping (Regex) Issues

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？