开发者

Having trouble parsing website with regular expressions

I'm trying to parse search results for WorldCat.org in order to fetch basic information about books and articles.

A typical search result (and the one I'm using for testing) can be found here: http://www.worldcat.org/search?q=ti%3Aorganizations&fq=dt%3Abks&qt=advanced&dblist=638

The html for that page is here: http://pastebin.com/w2U91F1i

Here is the regular expression I'm using with PHP preg_match_all to capture basic details about each entry:

$data = file_get_contents($url);
preg_match_all('/<div class="oclc_number">(.*?)<\/div>\n.*?<div class="name">\n.*?<a href="(.*?)"><stron开发者_JAVA技巧g>(.*?)<\/strong><\/a>\n.*?\n\n<div class="author">by\s(.*?)<\/div><div class="type">.*?<span class=\'itemType\'>(.*?)<\/span>.*?\n.*?<span class="itemLanguage">(.*?)<\/span>.*?<div class="type">Publication:\s*?(.*?)<\/div>/', $data, $topics, PREG_SET_ORDER);

When I use this expression with the regexr tool (http://gskinner.com/RegExr/) it works just fine (except I use \r instead of \n -- usually \r doesn't work for me). But preg_match_all gives me an empty array each time.

Any clues as to what I'm doing wrong?


Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.


HTML is not a regular language, don't try to parse it with regular expressions!

Read the first answer here:

RegEx match open tags except XHTML self-contained tags

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜