开发者

How do I screen scrape a website and get data within div?

How can I screen scrape a 开发者_JAVA技巧website using cURL and show the data within a specific div?


Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.


After downloading with cURL use XPath to select the div and extract the content.


A possible alternative.

# We will store the web page in a string variable.
var string page

# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page

# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page

This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.

stex -r -c "^<div&ABC&</div\>^" $page

Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.


Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.

Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.

Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜